unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
565 stars 57 forks source link

Incorrect definition of SpacingMark used #108

Closed syvb closed 3 months ago

syvb commented 2 years ago

UAX #29 defines SpacingMark as:

Grapheme_Cluster_Break ≠ Extend, and General_Category = Spacing_Mark, or _any of the following (which have General_Category = OtherLetter): U+0E33 ( ำ ) THAI CHARACTER SARA AM U+0EB3 ( ຳ ) LAO VOWEL SIGN AM

Exceptions: The following (which have General_Category = Spacing_Mark and would otherwise be included) are specifically excluded: [24 exception characters]

In this crate's implementation of rule GB9a, only the "General_Category = Spacing_Mark" part is checked. This crate doesn't check that Grapheme_Cluster_Break ≠ Extend or implement any of the 24 exclusions or 2 inclusions. The impact of this is very minor though, since it only affects a small set of characters, and only in extended mode.

(originally noted in #107)

Jules-Bertholet commented 3 months ago

There's no issue here. GC in the code does not stand for "General Category", but for GraphemeCat. The categories are read from https://www.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt, which contains the right values. This can be closed