unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.33k stars 173 forks source link

FYI: Proposals for changes to rules Unicode Standard Annexes 14 and 29 #3255

Closed eggrobin closed 1 day ago

eggrobin commented 1 year ago

The Properties and Algorithms Group plans to recommend the following proposals to Unicode Technical Committee #‌175 later this month. If they are accepted, the changes would be published as part of Unicode Version 15.1, in September.

UAX #‌14:

UAX #‌29:

sffc commented 1 year ago

@makotokato @aethanyc

sffc commented 1 year ago

@aethanyc or @makotokato can you take this issue? Probably for 1.x Priority.

sffc commented 1 year ago

Discussion: Longer term, we would like it if the upstreamed TOML files would be updated along with the specification, so that ICU4X does not need to do anything more than pulling in updates from upstream.

eggrobin commented 1 year ago

Looking at the toml files, my impression is that they define a state machine transitioned by code point (that is, a [[tables]] record defines a transition from its left state to its name state when the next code point has the class right), and that the breaks at each step are determined by the [[rules]] with a matching left state, and looking ahead one code point matching the class right.

The following new line breaking rules require more lookahead than that:

These require looking at two code points to the right of the (non-)break, plus any intervening CM (since these are after LB9).

hsivonen commented 10 months ago

Gecko bug

eggrobin commented 10 months ago

Henri, this is interesting.

In your comment you correctly identified what LB15a and LB15b are trying to do, and why they need to do that (instead of treating Pi as LB=OP and Pf as LB=CL: that would mess with German, Finnish, etc. usage of Pf initially or Pi finally).

However, these new rules do not help with the Chinese issue at hand, since there are no spaces (there may visually appear to be space, but that is because U+2018 etc. have ambiguous width; here they are wide). This has recently come to the attention of the Properties and Algorithms Group of the UTC; it may be possible to do something about it in the ID QU ID case. I will mention that issue in that discussion. Nothing will happen on that front before Unicode 16.0 in September 2024 though.

aethanyc commented 3 months ago

We still need to update line segmenter to Unicode 15.1. @makotokato is working on it.

eggrobin commented 3 months ago

I am experimenting with moving LB8a and LB9 into the code of the line segmenter, as

  1. the combination of these rules makes the state table extraordinarily painful to maintain (and it makes it large), as every state needs to be replicated: X ZWJ is different from X for most X since there is no break after ZWJ per LB8a, but X ZWJ CM brings you back to the X state, so the X ZWJ states cannot be merged;
  2. these rules cannot be tailored (so there is no reason to allow for custom data to change their behaviour), and are in practice reasonably stable: they last changed in Unicode 11 (2018), following up on some earlier Unicode 9 (2016) changes for emoji ZWJ sequences; contrast the other rules that have been changing wildly every year.