Closed Lucretiel closed 3 years ago
This crate implements the Default Word Boundary Specification from UAX29, which states:
The following is a general specification for word boundaries—language-specific rules in [CLDR] should be used where available.
and
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
It appears that the breaks.jsp
tool linked above uses CLDR for segmentation. (It's linked from the CLDR site here.) According to this page, it sounds like CLDR always uses dictionary-based breaking for words in CJK scripts, including Hangul. Unfortunately, I don't know of any equivalent implementation in Rust.
Huh- for some reason I had understood that CLDR was primarily for locale-sensitive text handling, and the demo site included no such prompt to select language specific ruless
Huh- for some reason I had understood that CLDR was primarily for locale-sensitive text handling, and the demo site included no such prompt to select language specific ruless
I thought so too, and I'm not 100% sure what algorithm or settings that demo site is using.
However, it appears that the unicode-segmentation
crate implements the default word boundary spec correctly. Both c
and 를
have Word_Break = ALetter
, and the spec says not to break between two ALetter
characters.
Ah, that's what I was looking for; I was having trouble finding out conclusively if 를 is an ideograph (since ALetter explicitly rejects ideograph characters). That's otherwise consistent with my reading of the specification; thanks for your help!
Our implementation is correct here, what's happening is that UAX 29 allows implementations to diverge from the algorithm in ways that affect degenerate cases, like this one.
Hangul is not an ideographic or logographic writing system.
Do we have something to support line breaks in UAX 14 https://www.unicode.org/reports/tr14/?
https://github.com/withoutboats/heck/pull/28#issuecomment-787462475
Where Hello, world. 你好,世界!
becomes Hello,
world.
你好,
世界!
? The punctuation are not necessary. And even if the han character was stuck with the english character it will be separated, like 一a二
becomes 一
a
二
? Or maybe it is out of the scope of this project? I thought since the name is "unicode-segmentation" so it should have something like this.
Do we have something to support line breaks in UAX 14 https://www.unicode.org/reports/tr14/?
Servo uses xi-unicode for UAX 14 line breaking. There is also a unicode-linebreak crate.
I don't think this is related to the current issue. Please open a new issue if there are more questions about this.
@Lucretiel What you said seemed correct.
Normally word breaking does not require breaking between different scripts. However, adding that capability may be useful in combination with other extensions of word segmentation. For example, in Korean the sentence “I live in Chicago.” is written as three segments delimited by spaces:
나는 Chicago에 산다.
According to Korean standards, the grammatical suffixes, such as “에” meaning “in”, are considered separate words. Thus the above sentence would be broken into the following five words:
나, 는, Chicago, 에, and 산다.
Separating the first two words requires a dictionary lookup, but for Latin text (“Chicago”) the separation is trivial based on the script boundary.
Consider this string:
" abc를 "
According to Unicode's demo implementation of word segmentation, I'd expect this to be split into 4 words:
" "
,"abc"
,"를"
, and" "
. However, the observed behavior (playground) is that it only splits into 3 words; the"abc를"
is grouped together.