unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

Investigate BudouX and consider using it for CJK+ segmentation #1803

Open sffc opened 2 years ago

sffc commented 2 years ago

BudouX is a new project out of Google for CJ segmentation with a focus on data size reduction. We should investigate it as an option for ICU4X.

https://github.com/google/budoux

The docs say that it may also be scalable to other languages. I think we should continue with the LSTM approach for Thai/Lao/Khmer/Burmese, but it would be worth investigating what BudouX could bring to the table in that case.

CC @hiroyuki-komatsu @aethanyc @makotokato

makotokato commented 2 years ago

Reference: https://unicode-org.atlassian.net/browse/ICU-21699

echeran commented 2 years ago

an existing Rust port: https://github.com/sg0hsmt/budoux-rs

makotokato commented 2 years ago

BudouX segmenter rules seems to be different of others. When noun with preposition, particle and etc, it returns combined sentence.

Using dictionary (but this cannot follow all words and data is too big) or morpheme

今日 / は / いい / 天気 / です。

Using BudouX (data size is small even if JSON)

今日は / いい / 天気です。

This example is that 今日 (Today) and (This is a particle for subject) are two words as strict word rule. But BudouX is one word. But "word segment" is ambiguous in Japanese, so both will be acceptable as Japanese.

Also, BodouX has zh-Hans data too, but we need zh-Hant too.

I guess that this is worth to add this with feature=bodoux since CJ dictionary is too big?

makotokato commented 2 years ago

Of course, when using 128B utf-8 text, it is slower than dictionary in ICU4C and ICU4X (480,071 ns/iter vs 946 ns/iter)

makotokato commented 2 years ago

https://github.com/makotokato/budoux-rs

sffc commented 2 years ago

Thanks! It looks like there may be low-hanging fruit to increase the performance: https://github.com/makotokato/budoux-rs/issues/1

makotokato commented 1 month ago

Also, https://github.com/mozilla/standards-positions/issues/877