unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
565 stars 57 forks source link

Support Unicode 15.1 #124

Closed syvb closed 11 months ago

syvb commented 11 months ago

Adds Unicode 15.1 support.

Updating tests

Turns out scripts/unicode_gen_breaktests.py was last run for Unicode 11 - every subsequent updater forgot to run it. I updated the GitHub Action that checks scripts/unicode.py was run to also check for scripts/unicode_gen_breaktests.py being run.

Devanagari mis-segmentation

There are a few cases where Devanagari grapheme segmentation fails after updating the test data from Unicode 11 to Unicode 15. I just skipped those failing tests for now.

syvb commented 11 months ago

I originally described a categorization issue with ۝ - turns out the Unicode data files are correct, I was just using outdated ones. Oops. I kept the tests that verify ۝ (and the Syriac abbreviation mark) are categorized correctly.