orling / grapheme-splitter

A JavaScript library that breaks strings into their individual user-perceived characters.
MIT License
917 stars 45 forks source link

Thai character sara am (U+0E33) treated as a combining character #9

Closed rideaddict closed 6 years ago

rideaddict commented 6 years ago

When I split a Thai string that contains this character it is combined with the previous character. According to the documentation 'This characters interacts typographically with the preceding consonant, but is not classed as a combining character.'

orling commented 6 years ago

Looking at standard UAX-29's specs, this is the correct behaviour. Rule GB9a states "Do not break before SpacingMarks" https://unicode.org/reports/tr29/#GB9a

and this character is categorized as SpacingMark:

https://codepoints.net/U+0E33?lang=en "The Grapheme Cluster Break is SpacingMark."

Even looking at its rendition, it is not really a stand-alone character (the dotted circle signifying the missing part). Splitting it would definitely look just as bad as a combining mark without its letter. Maybe it's not classified as a combining character because it is composed of two others chars.