Support line breaking for South East Asian languages (Bermeses, Lao, Khmer)

aethanyc commented 3 years ago

Per SA: Complex-Context Dependent in UAX14, find line breaking opportunities in South East Asian languages requires morphological analysis.

Currently, https://github.com/unicode-org/lstm_word_segmentation contain Thai and Burmese (Myanmar) models. We have a prototype experimental/segmenter_lstm/ for Thai.

For Burmese, we probably need to import the model from lstm_word_segmentation, and integrate into segmenter_lstm.

For Lao and Khmer, we have two choices:

create LSTM models for them (preferable if it is technically possible)
add dictionary support like ICU. Dictionary data in https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr/dictionaries

cc @makotokato @dminor

zbraniecki commented 3 years ago

@FrankYFTang can you share what you're doing on that for ICU4C?

FrankYFTang commented 3 years ago

@FrankYFTang can you share what you're doing on that for ICU4C?

I already land what I intend to do for ICU70 into the ICU tree https://docs.google.com/document/d/1EVK2CwOmUamJwMOMbbdTz7tuaV0IR21rMoH7a3pyFwE/edit#heading=h.xgjl2srtytjt

I believe ICU4X already have those code

No plan to work on Laos and Khmer at this point.

FrankYFTang commented 3 years ago

What ICU4C and ICU4J have is similar to what got put into https://github.com/unicode-org/icu4x/tree/main/experimental/segmenter_lstm

makotokato commented 3 years ago

I am considering dictionary based segmenter is part of UAX29 segmenter that is similar to ICU4C/ICU4J. We can use it for Lao and Khmer line segmenter too.

makotokato commented 2 years ago

We have added 4 language models by LSTM and dictionary. Close this.

unicode-org / icu4x

Support line breaking for South East Asian languages (Bermeses, Lao, Khmer) #813