unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
571 stars 56 forks source link

Apparent bug in word splitting with Hangul character #90

Closed Lucretiel closed 3 years ago

Lucretiel commented 3 years ago

Consider this string:

" abc를 "

According to Unicode's demo implementation of word segmentation, I'd expect this to be split into 4 words: " ", "abc", "를", and " ". However, the observed behavior (playground) is that it only splits into 3 words; the "abc를" is grouped together.

mbrubeck commented 3 years ago

This crate implements the Default Word Boundary Specification from UAX29, which states:

The following is a general specification for word boundaries—language-specific rules in [CLDR] should be used where available.

and

For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

It appears that the breaks.jsp tool linked above uses CLDR for segmentation. (It's linked from the CLDR site here.) According to this page, it sounds like CLDR always uses dictionary-based breaking for words in CJK scripts, including Hangul. Unfortunately, I don't know of any equivalent implementation in Rust.

Lucretiel commented 3 years ago

Huh- for some reason I had understood that CLDR was primarily for locale-sensitive text handling, and the demo site included no such prompt to select language specific ruless

mbrubeck commented 3 years ago

Huh- for some reason I had understood that CLDR was primarily for locale-sensitive text handling, and the demo site included no such prompt to select language specific ruless

I thought so too, and I'm not 100% sure what algorithm or settings that demo site is using.

However, it appears that the unicode-segmentation crate implements the default word boundary spec correctly. Both c and have Word_Break = ALetter, and the spec says not to break between two ALetter characters.

Lucretiel commented 3 years ago

Ah, that's what I was looking for; I was having trouble finding out conclusively if 를 is an ideograph (since ALetter explicitly rejects ideograph characters). That's otherwise consistent with my reading of the specification; thanks for your help!

Manishearth commented 3 years ago

Our implementation is correct here, what's happening is that UAX 29 allows implementations to diverge from the algorithm in ways that affect degenerate cases, like this one.

Hangul is not an ideographic or logographic writing system.

pickfire commented 3 years ago

Do we have something to support line breaks in UAX 14 https://www.unicode.org/reports/tr14/?

https://github.com/withoutboats/heck/pull/28#issuecomment-787462475

Where Hello, world. 你好,世界! becomes Hello, world. 你好, 世界!? The punctuation are not necessary. And even if the han character was stuck with the english character it will be separated, like 一a二 becomes a ? Or maybe it is out of the scope of this project? I thought since the name is "unicode-segmentation" so it should have something like this.

mbrubeck commented 3 years ago

Do we have something to support line breaks in UAX 14 https://www.unicode.org/reports/tr14/?

Servo uses xi-unicode for UAX 14 line breaking. There is also a unicode-linebreak crate.

I don't think this is related to the current issue. Please open a new issue if there are more questions about this.

pickfire commented 3 years ago

@Lucretiel What you said seemed correct.

Normally word breaking does not require breaking between different scripts. However, adding that capability may be useful in combination with other extensions of word segmentation. For example, in Korean the sentence “I live in Chicago.” is written as three segments delimited by spaces:

나는  Chicago에  산다.

According to Korean standards, the grammatical suffixes, such as “에” meaning “in”, are considered separate words. Thus the above sentence would be broken into the following five words:

나,  는,  Chicago,  에, and  산다.

Separating the first two words requires a dictionary lookup, but for Latin text (“Chicago”) the separation is trivial based on the script boundary.