Closed aethanyc closed 2 years ago
@FrankYFTang can you share what you're doing on that for ICU4C?
@FrankYFTang can you share what you're doing on that for ICU4C?
I already land what I intend to do for ICU70 into the ICU tree https://docs.google.com/document/d/1EVK2CwOmUamJwMOMbbdTz7tuaV0IR21rMoH7a3pyFwE/edit#heading=h.xgjl2srtytjt
I believe ICU4X already have those code
No plan to work on Laos and Khmer at this point.
What ICU4C and ICU4J have is similar to what got put into https://github.com/unicode-org/icu4x/tree/main/experimental/segmenter_lstm
I am considering dictionary based segmenter is part of UAX29 segmenter that is similar to ICU4C/ICU4J. We can use it for Lao and Khmer line segmenter too.
We have added 4 language models by LSTM and dictionary. Close this.
Per SA: Complex-Context Dependent in UAX14, find line breaking opportunities in South East Asian languages requires morphological analysis.
Currently, https://github.com/unicode-org/lstm_word_segmentation contain Thai and Burmese (Myanmar) models. We have a prototype
experimental/segmenter_lstm/
for Thai.For Burmese, we probably need to import the model from lstm_word_segmentation, and integrate into
segmenter_lstm
.For Lao and Khmer, we have two choices:
cc @makotokato @dminor