Han/Katakana/Hiragana property for word segmenter's datagen

unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.

https://icu4x.unicode.org

Other

1.38k stars 176 forks source link

Han/Katakana/Hiragana property for word segmenter's datagen #2239

Open makotokato opened 2 years ago

makotokato commented 2 years ago

From https://github.com/unicode-org/icu4x/pull/2209

In datagen of word segmenter, we assign special property for east asian language to use lstm or dictionary. We need to improve CJ support, we have to assign same property as EA or special property.

Also, actually, UAX29's rule has Katakana rules, but we might have to dictionary for Katakana instead of UAX29 rule.

sffc commented 2 years ago

@makotokato Is this an enhancement that can be done after 1.0, or does it affect the schema of the data?

makotokato commented 2 years ago

@makotokato Is this an enhancement that can be done after 1.0, or does it affect the schema of the data?

Yes, enhancement issue. Han/Kanji and Hiragana are already handled by dictionary. But UAX29's word segmenter spec is Katakana × Katakana. If using dictionary for Katakana, we have to modify spec or add something notes to UAX29.

makotokato commented 2 years ago

Han and Hiragana are done by https://github.com/unicode-org/icu4x/commit/7215608a0984a33439e43835de17a729a521bd51

sffc commented 2 years ago

My understanding is that this is fully in the datagen crate (change the outputted rule tables, not not the code that reads from the rule tables). This is a good 1.x issue.

sffc commented 2 years ago

Good first issue for someone interested in coming up to speed on rule-based segmentation.

CC @younies