golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.29k stars 17.58k forks source link

x/text/collate: Compare and CompareString broken in "ka-shifted" collation mode #68166

Open danderson opened 3 months ago

danderson commented 3 months ago

Go version

go version go1.22.3 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/dave/.cache/go-build'
GOENV='/home/dave/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/dave/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/dave/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/nix/store/00mg4vlhzmm7gi9bd5v5ydjlgrywpc3n-go-1.22.3/share/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/nix/store/00mg4vlhzmm7gi9bd5v5ydjlgrywpc3n-go-1.22.3/share/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.3'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/dave/hack/psl/tools/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1869424909=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Tried to compare strings with the "shifted" variable weighting option, which you can select with e.g. language tag en-u-ka-shifted.

Demo: https://go.dev/play/p/afAOLzqh3Oe

What did you see happen?

99% of Compare() and CompareString() calls on different strings return 0.

What did you expect to see?

Comparison results that match the configured collation, or at least a courtesy panic letting me know it's known broken (spoiler alert...)

gabyhelp commented 3 months ago

Similar Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

danderson commented 3 months ago

Pretty simple bug: https://cs.opensource.google/go/x/text/+/refs/tags/v0.16.0:collate/collate.go;l=166 . When the collator is in the shifted alternate mode, compare() silently skips the entire primary level, and thus the comparisons are == 0 always when comparing ascii. They're occasionally != 0 if you get lucky and the secondary/tertiary levels have tie-breakers.

Looking at the history, x/text/collate seems unmaintained (last change that wasn't fmts or Unicode upgrades was 2016 :( ). Given this, my suggestion tactically would be to rip out the custom codepaths for Compare/CompareString, and have them generate and bytes.Compare full collation keys. It's less efficient for one-off compares, but I think it would be correct, and a future maintainer can always make it faster again since it's all internal package gubbins.

ianlancetaylor commented 3 months ago

CC @mpvl

danderson commented 3 months ago

I hacked up a more comprehensive debugging tool: https://go.dev/play/p/GcsuboreVW8 . For a bunch of comparisons and collations, it prints:

There's also main() in there suitable for running from the CLI as well.

What I reported above can be seen in the FULL vs. ITER of collations en and en-u-ka-shifted: the latter reports all inputs as equivalent.

Further weirdness I discovered while writing the tool:

Summary:

I have no idea where these new behaviors come from, I haven't paged in enough of the code to understand what's happening.

danderson commented 3 months ago

Looking more closely at UTS 35, the multi-key language tags above are not well formed because multiple -u- extensions is ill-formed (I would have hoped that language.Parse would tell me :( ), although based on the lang.String() output it looks like the parser is graciously accepting the ill-formed inputs.

I tried collations en-u-ka-shifted-kn-true, en-u-ka-shifted-kn, en-u-ka-shifted-u-kn-true, en-u-ka-shifted-u-kn, en-u-kn-ka-shifted, en-u-kn-true-ka-shifted, en-u-kn-true-u-ka-shifted, all of which are trying to request the same behavior (ka=shifted, kn=true) in various degrees of ill-formed or non-canonicality (the canonical tag is en-u-ka-shifted-kn). All of them break both iterative comparison and numeric sorting, except for the ill-formed and non-canonical en-u-kn-true-u-ka-shifted, where both comparison styles are correct and both use numeric sorting.

danderson commented 3 months ago

Correction: en-u-kn-true-u-ka-shifted is also broken, it produces a numeric sort but fails to ignore punctuation. It's hard to see until you start sorting whole slices with these comparators, but afaict there is just no way to enable kn=true and ka=shifted at the same time, even if you avoid Compare/CompareString and use collation keys explicitly, you can only get kn=true xor ka=shifted no matter how you phrase your request.

Given that numeric sorting is a wrapper around the standard weight, I'm assuming that it lacks awareness of ka=shifted and so breaks the weight adjustments that need to happen, but again I haven't paged in enough of the code to be sure.