Open markusicu opened 1 year ago
More generally, are the diffs in lines related to QU intentional?
No. Blame the author of #456.
It looks like SegmenterDefault is correct, and SegmenterCldr is wrong. This should be straightforward to fix, but I am going to try not to do that while on vacation.
Ping @macchiati @eggrobin this was meant “for Unicode 15.1” and CLDR 44...
@eggrobin Do we need a CLDR version of LineBreakTest.* files, or will UTC and CLDR line break rules be the same starting with Unicode 16?
@echeran FYI
will UTC and CLDR line break rules be the same starting with Unicode 16?
They probably will be. But even if they were not,
Do we need a CLDR version of LineBreakTest.* files[?]
Evidently we do not, since:
what we do have in this repository is untested garbage, since, as you noted,
it seems like the CLDR expression
$QU_Pf=($QU_Pi $X)
has a typo
which it still does, because I apparently forgot to do anything about it when I came back from vacation.
Maybe it could be useful for CLDR to have a way to publish tailored rules; but what we have it isn’t a solution to that problem, it is a major maintenance burden. We have too many independent copies of these rules all over the place already, most of which we need. Let’s get rid of this one which we don’t.
Do we need a CLDR version of LineBreakTest.* files[?]
Evidently we do not, since:
- we seemingly do not push it to CLDR, let alone consume it in ICU;
We do have that test file in ICU: https://github.com/unicode-org/icu/blob/main/icu4c/source/test/testdata/LineBreakTest.txt
And ICU's icu4c/source/test/intltest/rbbitst.cpp has exceptions for some of its test cases, which suggests that we are using it for testing.
We have an open/unassigned ICU ticket for figuring out why we need these exceptions: https://unicode-org.atlassian.net/browse/ICU-21097 “Investigate rbbi tests for closed ICU-8151/12017 that still fail”
I am tempted to assign that to you :-P
We also have WordBreakTest exceptions there for https://unicode-org.atlassian.net/browse/ICU-22127 “Update test skips if colon tailoring per ICU-22112 moves into UAX #29”
I think this means that whenever CLDR segmentation differs from TUS segmentation, we should have the Unicode Tools segmenter rules for CLDR reflect CLDR behavior, generate CLDR-specific test files, and have ICU test with those files -- rather than hack exceptions into the ICU test code.
We do have that test file in ICU: https://github.com/unicode-org/icu/blob/main/icu4c/source/test/testdata/LineBreakTest.txt
That is the file from the UCD, not a CLDR version.
This one, generated by yours truly in the early afternoon of the 8th of August: https://github.com/unicode-org/unicodetools/blob/final-15.1-20230908/unicodetools/data/ucd/dev/auxiliary/LineBreakTest.txt.
I am tempted to assign that to you :-P
Sure.
We also have WordBreakTest exceptions there for https://unicode-org.atlassian.net/browse/ICU-22127 “Update test skips if colon tailoring per ICU-22112 moves into UAX #29”
That probably wants to be assigned to me as well.
I think this means that whenever CLDR segmentation differs from TUS segmentation, we should have the Unicode Tools segmenter rules for CLDR reflect CLDR behavior, generate CLDR-specific test files, and have ICU test with those files -- rather than hack exceptions into the ICU test code.
That sounds somewhat reasonable; for ICU4X which has 15.0 without the numbers tailoring, I generated a LineBreakTest.txt from my fork of the tools (see https://github.com/unicode-org/icu4x/blob/7397d7d85ab4d87fef1d42760d3dfe6e88cd1391/components/segmenter/tests/testdata/LineBreakTest.txt), which seems to be the same approach.
But let’s cross that bridge when we get to it; for now we are aiming for convergence in 16.0.
Right now the 15.1 line breaking rules in SegmenterCldr.txt are used by no-one, and this is a good thing since they are wrong and untested (and we went through a release with these wrong rules!). Let’s get rid of that quasi-copy before someone gets hurt.
If we get back to a point where we need a CLDR version of the segmenter rules, they should be maintained as a diff instead of having incorrect copies of most of the rules (similar considerations apply to the many CSS variants in ICU; at least there the process calls for updating them by applying diffs of the root file, but that is quite painful).
I would propose a bit of a policy change to help manage this.
That way we don't have to deal with the complexities of rule changes. It would require agreement by CLDR to never make rule changes, and by the UTC to be open to NOOP rule additions.
This seems reasonable in principle; but I will note that we do not publish the unicodetools version of the segmenter rules as part of the UCD, so a no-op there does not exist as far as the UTC is concerned; thus there is no need for anything to be decided at the UTC or even the PAG level; only the maintainers of the tools need to be OK with it.
Which means we can cross this bridge, too, when we get to it.
even easier, then.
Since Unicode 15.1 and consensus UTC-175-C26 (PR #477), the default and CLDR grapheme break rules (SegmenterDefault.txt vs. SegmenterCldr.txt) are the same. Therefore, we might want to stop generating UCD/(version)/cldr/GraphemeBreakTest.* files. We could also continue to generate them just to keep exercising the relevant code paths.
However, the line break rules are still different, and we have not been generating UCD/(version)/cldr/LineBreakTest.* files. Why not? Shouldn't we generate those, check them into the CLDR repo, and use them for testing in ICU?
Also, it seems like the CLDR expression
$QU_Pf=($QU_Pi $X)
has a typo.More generally, are the diffs in lines related to QU intentional?
@eggrobin @macchiati @echeran
Diffs:
-->