unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.3k stars 169 forks source link

char16trie converter #2519

Open makotokato opened 1 year ago

makotokato commented 1 year ago

Now segmenter uses char16trie for dictionary segmenter. East Asian dictionary can remove/move to LSTM, but Chinese and Japanese still use it.

Actually, current data is generated by ICU4C's tools then binary data by that tool converted to TOML file. So I guess that it is better to add generation tools for char16trie from ICU4C's dictionary text file.

sffc commented 1 year ago

Consider doing like we did for the CodePointTrieBuilder. Rather than writing the code ourselves, we compile the ICU4C builder code into a WASM file and ship that in our repo.

sffc commented 1 year ago

@makotokato Does this block any other issues? Can you set an assignee (or "help wanted") and a milestone (or "backlog")?

makotokato commented 1 year ago

@makotokato Does this block any other issues? Can you set an assignee (or "help wanted") and a milestone (or "backlog")?

Not blocker.