Open sffc opened 2 years ago
an existing Rust port: https://github.com/sg0hsmt/budoux-rs
BudouX segmenter rules seems to be different of others. When noun with preposition, particle and etc, it returns combined sentence.
今日 / は / いい / 天気 / です。
今日は / いい / 天気です。
This example is that 今日
(Today) and は
(This is a particle for subject) are two words as strict word rule. But BudouX is one word. But "word segment" is ambiguous in Japanese, so both will be acceptable as Japanese.
Also, BodouX has zh-Hans data too, but we need zh-Hant too.
I guess that this is worth to add this with feature=bodoux
since CJ dictionary is too big?
Of course, when using 128B utf-8 text, it is slower than dictionary in ICU4C and ICU4X (480,071 ns/iter vs 946 ns/iter)
Thanks! It looks like there may be low-hanging fruit to increase the performance: https://github.com/makotokato/budoux-rs/issues/1
BudouX is a new project out of Google for CJ segmentation with a focus on data size reduction. We should investigate it as an option for ICU4X.
https://github.com/google/budoux
The docs say that it may also be scalable to other languages. I think we should continue with the LSTM approach for Thai/Lao/Khmer/Burmese, but it would be worth investigating what BudouX could bring to the table in that case.
CC @hiroyuki-komatsu @aethanyc @makotokato