Open makotokato opened 2 years ago
@makotokato Is this an enhancement that can be done after 1.0, or does it affect the schema of the data?
@makotokato Is this an enhancement that can be done after 1.0, or does it affect the schema of the data?
Yes, enhancement issue. Han/Kanji and Hiragana are already handled by dictionary. But UAX29's word segmenter spec is Katakana × Katakana
. If using dictionary for Katakana, we have to modify spec or add something notes to UAX29.
Han and Hiragana are done by https://github.com/unicode-org/icu4x/commit/7215608a0984a33439e43835de17a729a521bd51
My understanding is that this is fully in the datagen crate (change the outputted rule tables, not not the code that reads from the rule tables). This is a good 1.x issue.
Good first issue for someone interested in coming up to speed on rule-based segmentation.
CC @younies
From https://github.com/unicode-org/icu4x/pull/2209
In datagen of word segmenter, we assign special property for east asian language to use lstm or dictionary. We need to improve CJ support, we have to assign same property as EA or special property.
Also, actually, UAX29's rule has Katakana rules, but we might have to dictionary for Katakana instead of UAX29 rule.