Implement Unicode properties required by UAX 29

unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.

https://icu4x.unicode.org

Other

1.38k stars 176 forks source link

Implement Unicode properties required by UAX 29 #1214

Closed aethanyc closed 3 years ago

aethanyc commented 3 years ago

We need the following Unicode properties:

Grapheme_Cluster_Break
Sentence_Break
Word_Break
Extended_Pictographic for GB11

aethanyc commented 3 years ago

@makotokato Could you check whether these properties are sufficient for UAX29?

makotokato commented 3 years ago

We need the following Unicode properties:

* Grapheme_Cluster_Break

* Sentence_Break

* Word_Break

* Extended_Pictographic for [GB11](https://www.unicode.org/reports/tr29/#GB11)

As UAX#29 spec docs, these are converted. But, implementing word segmenter, we need more categories for Chinese, Japanese and East Asian languages.

For East Asia languages, we can recognize SA in line break property. CJ will be ID and some properties from line break or others.

aethanyc commented 3 years ago

As UAX#29 spec docs, these are converted. But, implementing word segmenter, we need more categories for Chinese, Japanese and East Asian languages.

For East Asia languages, we can recognize SA in line break property. CJ will be ID and some properties from line break or others.

I assume we need to query a codepoint's Scriptto switch to different language break engine like ICULanguageBreakFactory::loadEngineFor().

Luckily, ICU4X already implemented Script property. Here is a test of script codepointtrie, but I think @echeran is documenting a nicer API icu_properties::maps::get_script() in #1204.

aethanyc commented 3 years ago

@makotokato Here are the example in the document to use Word_Break property. The Extended_Pictographic and Script are also available. Note the getter's return value may change pending on the discussion in #1239.

makotokato commented 3 years ago

Thanks a lot.