adjust cldr/*BreakTest generation for Unicode 15.1

markusicu commented 1 year ago

Since Unicode 15.1 and consensus UTC-175-C26 (PR #477), the default and CLDR grapheme break rules (SegmenterDefault.txt vs. SegmenterCldr.txt) are the same. Therefore, we might want to stop generating UCD/(version)/cldr/GraphemeBreakTest.* files. We could also continue to generate them just to keep exercising the relevant code paths.

However, the line break rules are still different, and we have not been generating UCD/(version)/cldr/LineBreakTest.* files. Why not? Shouldn't we generate those, check them into the CLDR repo, and use them for testing in ICU?

Also, it seems like the CLDR expression $QU_Pf=($QU_Pi $X) has a typo.

More generally, are the diffs in lines related to QU intentional?

@eggrobin @macchiati @echeran

Diffs:

$ diff -u unicodetools/src/main/resources/org/unicode/tools/SegmenterDefault.txt unicodetools/src/main/resources/org/unicode/tools/SegmenterCldr.txt

-->

--- unicodetools/src/main/resources/org/unicode/tools/SegmenterDefault.txt  2023-06-01 10:47:21.084375072 -0700
+++ unicodetools/src/main/resources/org/unicode/tools/SegmenterCldr.txt 2023-06-01 10:47:21.080374729 -0700
@@ -108,8 +108,8 @@
 $ZWJ_O=\p{Line_Break=ZWJ}
 $ZWJ=\p{Line_Break=ZWJ}

-$QU_Pi=[$QU & \p{gc=Pi}]
-$QU_Pf=[$QU & \p{gc=Pf}]
+$QU_Pi=($QU_Pi $X)
+$QU_Pf=($QU_Pf $X)

 $DottedCircle = ◌

@@ -147,7 +147,7 @@
 $Spec3a_=[^ $SP $BA $HY $CM]
 $Spec3b_=[^ $BA $HY $CM]
 $Spec4_=[^ $NU $CM]
-##CLDR: $Spec5_=[$BK $CB $CR $LF $NL $SP $ZW]
+$Spec5_=[$BK $CB $CR $LF $NL $SP $ZW]

 # SPECIAL EXTENSIONS

@@ -195,7 +195,7 @@
 $ZWJ=($ZWJ $X)

 $QU_Pi=($QU_Pi $X)
-$QU_Pf=($QU_Pf $X)
+$QU_Pf=($QU_Pi $X)

 $DottedCircle=($DottedCircle $X)

@@ -267,11 +267,11 @@
 # LB 20  Break before and after unresolved CB.
 20.01)  ÷ $CB
 20.02) $CB ÷
-##CLDR: LB 20.9  Don't break between Hyphens and Letters when there is a break preceding the hyphen.
-##CLDR: Originally added as a Finnish tailoring, now promoted to default CLDR behavior.
-##CLDR: Must be before LB 21. Note: this is not default UAX #14 behaviour. See ICU issue ICU-8151.
-##CLDR: (Unlike in ICU, here we just check a limited set of known breaks, ignoring some cases like LB 14).
-##CLDR: 20.09) $Spec5_ $HY × $AL
+# LB 20.9  Don't break between Hyphens and Letters when there is a break preceding the hyphen.
+# Originally added as a Finnish tailoring, now promoted to default CLDR behavior.
+# Must be before LB 21. Note: this is not default UAX #14 behaviour. See ICU issue ICU-8151.
+# (Unlike in ICU, here we just check a limited set of known breaks, ignoring some cases like LB 14).
+20.09) $Spec5_ $HY × $AL
 # LB 21  Do not break before hyphen-minus, other hyphens, fixed-width spaces, small kana and other non-starters, or after acute accents.
 21.01) × $BA
 21.02) × $HY

eggrobin commented 1 year ago

More generally, are the diffs in lines related to QU intentional?

No. Blame the author of #456.

It looks like SegmenterDefault is correct, and SegmenterCldr is wrong. This should be straightforward to fix, but I am going to try not to do that while on vacation.

markusicu commented 8 months ago

Ping @macchiati @eggrobin this was meant “for Unicode 15.1” and CLDR 44...

markusicu commented 4 months ago

@eggrobin Do we need a CLDR version of LineBreakTest.* files, or will UTC and CLDR line break rules be the same starting with Unicode 16?

@echeran FYI

eggrobin commented 4 months ago

will UTC and CLDR line break rules be the same starting with Unicode 16?

They probably will be. But even if they were not,

Do we need a CLDR version of LineBreakTest.* files[?]

Evidently we do not, since:

we seemingly do not push it to CLDR, let alone consume it in ICU;
what we do have in this repository is untested garbage, since, as you noted,

it seems like the CLDR expression $QU_Pf=($QU_Pi $X) has a typo

which it still does, because I apparently forgot to do anything about it when I came back from vacation.

Maybe it could be useful for CLDR to have a way to publish tailored rules; but what we have it isn’t a solution to that problem, it is a major maintenance burden. We have too many independent copies of these rules all over the place already, most of which we need. Let’s get rid of this one which we don’t.

markusicu commented 4 months ago

Do we need a CLDR version of LineBreakTest.* files[?]

Evidently we do not, since:

we seemingly do not push it to CLDR, let alone consume it in ICU;

We do have that test file in ICU: https://github.com/unicode-org/icu/blob/main/icu4c/source/test/testdata/LineBreakTest.txt

And ICU's icu4c/source/test/intltest/rbbitst.cpp has exceptions for some of its test cases, which suggests that we are using it for testing.

We have an open/unassigned ICU ticket for figuring out why we need these exceptions: https://unicode-org.atlassian.net/browse/ICU-21097 “Investigate rbbi tests for closed ICU-8151/12017 that still fail”

I am tempted to assign that to you :-P

We also have WordBreakTest exceptions there for https://unicode-org.atlassian.net/browse/ICU-22127 “Update test skips if colon tailoring per ICU-22112 moves into UAX #‌29”

I think this means that whenever CLDR segmentation differs from TUS segmentation, we should have the Unicode Tools segmenter rules for CLDR reflect CLDR behavior, generate CLDR-specific test files, and have ICU test with those files -- rather than hack exceptions into the ICU test code.

eggrobin commented 4 months ago

We do have that test file in ICU: https://github.com/unicode-org/icu/blob/main/icu4c/source/test/testdata/LineBreakTest.txt

That is the file from the UCD, not a CLDR version.

This one, generated by yours truly in the early afternoon of the 8th of August: https://github.com/unicode-org/unicodetools/blob/final-15.1-20230908/unicodetools/data/ucd/dev/auxiliary/LineBreakTest.txt.

I am tempted to assign that to you :-P

Sure.

We also have WordBreakTest exceptions there for https://unicode-org.atlassian.net/browse/ICU-22127 “Update test skips if colon tailoring per ICU-22112 moves into UAX #‌29”

That probably wants to be assigned to me as well.

I think this means that whenever CLDR segmentation differs from TUS segmentation, we should have the Unicode Tools segmenter rules for CLDR reflect CLDR behavior, generate CLDR-specific test files, and have ICU test with those files -- rather than hack exceptions into the ICU test code.

That sounds somewhat reasonable; for ICU4X which has 15.0 without the numbers tailoring, I generated a LineBreakTest.txt from my fork of the tools (see https://github.com/unicode-org/icu4x/blob/7397d7d85ab4d87fef1d42760d3dfe6e88cd1391/components/segmenter/tests/testdata/LineBreakTest.txt), which seems to be the same approach.

But let’s cross that bridge when we get to it; for now we are aiming for convergence in 16.0.

Right now the 15.1 line breaking rules in SegmenterCldr.txt are used by no-one, and this is a good thing since they are wrong and untested (and we went through a release with these wrong rules!). Let’s get rid of that quasi-copy before someone gets hurt.

If we get back to a point where we need a CLDR version of the segmenter rules, they should be maintained as a diff instead of having incorrect copies of most of the rules (similar considerations apply to the many CSS variants in ICU; at least there the process calls for updating them by applying diffs of the root file, but that is quite painful).

macchiati commented 4 months ago

I would propose a bit of a policy change to help manage this.

CLDR confines itself to property overrides, not rule changes.
- Eg for the purpose of wordbreak, characters X and Y don't have the property wb:Hebrew_Letter in locales A & B.
If CLDR requires a rule change, it is made in the UTC.
- The rule can be a NOOP on the UTC level (eg, not triggered by any property changes in the UCD), and only triggered by a property override in CLDR.

That way we don't have to deal with the complexities of rule changes. It would require agreement by CLDR to never make rule changes, and by the UTC to be open to NOOP rule additions.

eggrobin commented 4 months ago

This seems reasonable in principle; but I will note that we do not publish the unicodetools version of the segmenter rules as part of the UCD, so a no-op there does not exist as far as the UTC is concerned; thus there is no need for anything to be decided at the UTC or even the PAG level; only the maintainers of the tools need to be OK with it.

Which means we can cross this bridge, too, when we get to it.

macchiati commented 4 months ago

even easier, then.

unicode-org / unicodetools

adjust cldr/*BreakTest generation for Unicode 15.1 #492