unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.38k stars 178 forks source link

Word segmentation is incorrect #5015

Open robertbastian opened 5 months ago

robertbastian commented 5 months ago

WB3c and WB3c interact in the same way LB8a and LB9 do. A correct implementation of that would require either duplicating every state as in https://github.com/unicode-org/icu4x/pull/4389, or hoisting the two rules into the logic as in https://github.com/unicode-org/icu4x/pull/5001.

The latter seems more attractive, both for data size and sanity of the maintainer; note that since rule_segmenter.rs is shared with extended grapheme cluster and sentence breaking, this will require passing a flag for that logic.

sffc commented 2 months ago

@eggrobin What is left on this issue?

eggrobin commented 2 months ago

What is left on this issue?

All of it? It was created to allow us to close the specific issue reported in https://github.com/unicode-org/icu4x/issues/4417, but word segmentation is still wrong and hasn’t changed since this was filed.