Closed aethanyc closed 3 years ago
@makotokato Could you check whether these properties are sufficient for UAX29?
We need the following Unicode properties:
* Grapheme_Cluster_Break * Sentence_Break * Word_Break * Extended_Pictographic for [GB11](https://www.unicode.org/reports/tr29/#GB11)
As UAX#29 spec docs, these are converted. But, implementing word segmenter, we need more categories for Chinese, Japanese and East Asian languages.
For East Asia languages, we can recognize SA
in line break property. CJ will be ID
and some properties from line break or others.
As UAX#29 spec docs, these are converted. But, implementing word segmenter, we need more categories for Chinese, Japanese and East Asian languages.
For East Asia languages, we can recognize
SA
in line break property. CJ will beID
and some properties from line break or others.
I assume we need to query a codepoint's Script
to switch to different language break engine like ICULanguageBreakFactory::loadEngineFor().
Luckily, ICU4X already implemented Script
property. Here is a test of script codepointtrie, but I think @echeran is documenting a nicer API icu_properties::maps::get_script()
in #1204.
@makotokato Here are the example in the document to use Word_Break property. The Extended_Pictographic and Script are also available. Note the getter's return value may change pending on the discussion in #1239.
Thanks a lot.
We need the following Unicode properties: