unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
570 stars 56 forks source link

Segmentation of combined emojis #42

Closed RazrFalcon closed 6 years ago

RazrFalcon commented 6 years ago
for c in UnicodeSegmentation::graphemes("🏳️‍🌈", true) {
    println!("{}", c);
}

Outputs:

🏳️‍
🌈

🏳️‍ 🌈

But should output:

🏳️‍🌈

🏳️‍🌈

Another example: 👮‍♀.

Is it UnicodeSegmentation bug or am I doing this wrong? For my current task this should be a single "character".

Manishearth commented 6 years ago

We're operating off an old unicode version (9) where that's not in the tables.

https://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakProperty.txt

Filed https://github.com/unicode-rs/unicode-segmentation/issues/43

That may take a while to fix, but it may be worth updating to Unicode 10 in the interim (which is an easier update than 10 to 11), and will also fix your issue.