Open danderson opened 3 months ago
Similar Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
Pretty simple bug: https://cs.opensource.google/go/x/text/+/refs/tags/v0.16.0:collate/collate.go;l=166 . When the collator is in the shifted alternate mode, compare()
silently skips the entire primary level, and thus the comparisons are == 0 always when comparing ascii. They're occasionally != 0 if you get lucky and the secondary/tertiary levels have tie-breakers.
Looking at the history, x/text/collate seems unmaintained (last change that wasn't fmts or Unicode upgrades was 2016 :( ). Given this, my suggestion tactically would be to rip out the custom codepaths for Compare/CompareString, and have them generate and bytes.Compare full collation keys. It's less efficient for one-off compares, but I think it would be correct, and a future maintainer can always make it faster again since it's all internal package gubbins.
CC @mpvl
I hacked up a more comprehensive debugging tool: https://go.dev/play/p/GcsuboreVW8 . For a bunch of comparisons and collations, it prints:
bytes.Compare()
over the collation keys obtained from coll.KeyFromString()
coll.CompareString()
There's also main() in there suitable for running from the CLI as well.
What I reported above can be seen in the FULL
vs. ITER
of collations en
and en-u-ka-shifted
: the latter reports all inputs as equivalent.
Further weirdness I discovered while writing the tool:
en-u-kn-true-u-ka-shifted
, this somehow fixes iterative collation to match key compares??en-u-ka-shifted-u-kn-true
, iterative collation continues to be wrong.ka=shifted
before kn=true
this breaks numeric sorting for both the sort keys and iterative compare.Summary:
en
and en-u-kn-true-u-ka-shifted
seem to work as specified by Unicode.en-u-ka-shifted
and en-u-ka-shifted-u-kn-true
silently breaks iterative comparisonen-u-ka-shifted-u-kn-true
additionally silently doesn't do numeric sortingI have no idea where these new behaviors come from, I haven't paged in enough of the code to understand what's happening.
Looking more closely at UTS 35, the multi-key language tags above are not well formed because multiple -u-
extensions is ill-formed (I would have hoped that language.Parse would tell me :( ), although based on the lang.String()
output it looks like the parser is graciously accepting the ill-formed inputs.
I tried collations en-u-ka-shifted-kn-true
, en-u-ka-shifted-kn
, en-u-ka-shifted-u-kn-true
, en-u-ka-shifted-u-kn
, en-u-kn-ka-shifted
, en-u-kn-true-ka-shifted
, en-u-kn-true-u-ka-shifted
, all of which are trying to request the same behavior (ka=shifted, kn=true) in various degrees of ill-formed or non-canonicality (the canonical tag is en-u-ka-shifted-kn
). All of them break both iterative comparison and numeric sorting, except for the ill-formed and non-canonical en-u-kn-true-u-ka-shifted
, where both comparison styles are correct and both use numeric sorting.
Correction: en-u-kn-true-u-ka-shifted
is also broken, it produces a numeric sort but fails to ignore punctuation. It's hard to see until you start sorting whole slices with these comparators, but afaict there is just no way to enable kn=true
and ka=shifted
at the same time, even if you avoid Compare/CompareString and use collation keys explicitly, you can only get kn=true
xor ka=shifted
no matter how you phrase your request.
Given that numeric sorting is a wrapper around the standard weight, I'm assuming that it lacks awareness of ka=shifted
and so breaks the weight adjustments that need to happen, but again I haven't paged in enough of the code to be sure.
Go version
go version go1.22.3 linux/amd64
Output of
go env
in your module/workspace:What did you do?
Tried to compare strings with the "shifted" variable weighting option, which you can select with e.g. language tag
en-u-ka-shifted
.Demo: https://go.dev/play/p/afAOLzqh3Oe
What did you see happen?
99% of Compare() and CompareString() calls on different strings return 0.
What did you expect to see?
Comparison results that match the configured collation, or at least a courtesy panic letting me know it's known broken (spoiler alert...)