unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
565 stars 57 forks source link

Grapheme cluster iterator returns non graphical characters #116

Closed DanielBauman88 closed 1 year ago

DanielBauman88 commented 1 year ago

Here's a playground example with two unicode control characters.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=b41c4d54ad849f1c19b6743925ec96f8

The iterator has 2 elements, even though neither control character is a grapheme-cluster as I understand it.

Is the implementation supposed to fall-back to iterating over code points when the code points are not part of a grapheme cluster?

Manishearth commented 1 year ago

This implementation follows the specification; and the specification inserts breaks between control characters.

even though neither control character is a grapheme-cluster as I understand it

No, they are at a Unicode level; they're not "user perceived characters", but they do not become a "part of" nearby user perceived characters either, so they get breaks around them.

There are a ton of different ways one may define user-perceived character, the Unicode spec picks something that gives a reasonable answer for most use cases: I recommend not relying too much on intuition as to what is and isn't a grapheme cluster unless you know the specification.

DanielBauman88 commented 1 year ago

I see, thanks for the explanation!