unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.34k stars 174 forks source link

Remove hardcoded segmenter data from datagen #3003

Closed robertbastian closed 1 year ago

robertbastian commented 1 year ago

There are many data files located here:

https://github.com/unicode-org/icu4x/tree/main/provider/datagen/data/segmenter

Is this the best place for the source of truth, or can we source them from elsewhere?

sffc commented 1 year ago

@aethanyc --- could you take a look at these files and list where you think appropriate data sources would be?

aethanyc commented 1 year ago

LSTM models is from https://github.com/unicode-org/lstm_word_segmentation/tree/develop/Models. We can download them if they are packed in lstm_word_segmentation repository. Note: they are currently only available in the develop branch, but not in the main branch.


Dictionary toml files are converted via the following command. (See the comment in the beginning of each toml file.) For example, CJ dictionary:

# This data is created by the following using ICU4C tools
# LD_LIBRARY_PATH=lib bin/gendict --uchars data/brkitr/dictionaries/cjdic.txt tmp.bin
# dd if=tmp.bin of=cjdict.dict bs=1 skip=64

Maybe the conversion and packing can be part of the ICU4C release process so that we can download it somewhere?


UAX14 rules are implemented ourselves in line.toml, and UAX29 rules in grapheme.toml, sentence.toml, and word.toml. These toml files are written by hand, not derived from other files. They are the source of truth, and should live in ICU4X.

cc @makotokato to double check my knowledge.

makotokato commented 1 year ago

Correct. https://github.com/unicode-org/icu4x/issues/2519 for char16trie data generation.

aethanyc commented 1 year ago

@robertbastian @sffc Per my comment above, it is nice to have the LSTM and dictionary data download/generated from somewhere, so this issue seems like a P3 or P4 to me. Does it have to be P1 to block the release?

sffc commented 1 year ago

I guess my main concern is that this involves adding additional sources to datagen, so it would be best to have those in place when people start using datagen for segmenter.

Manishearth commented 1 year ago

cc @eggrobin this is the issue, which has multiple parts:

robertbastian commented 1 year ago
sffc commented 1 year ago

Discussions:

robertbastian commented 1 year ago

With #3396 and #3399 dictionary and LSTM sources are now controlled by the client. However, we will keep the hardcoded fallback data around until 2.0.

What's left are the handwritten rule tables.

sffc commented 1 year ago

Discussion: we could upstream the data files into CLDR but we need to make sure they are easily maintainable.

sffc commented 1 year ago

@robertbastian to drive the relationship with CLDR to get these files upstreamed.

robertbastian commented 1 year ago

So there are segmentation files in CLDR already, which are generated from https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/resources/org/unicode/tools/SegmenterDefault.txt (this is a lot closer to the spec and might become part of the spec in the future). Given that the data is already there, I think it's unlikely that our (more processed) TOML versions will get accepted.

I've added parsing for these CLDR files in #3440, but I will need @makotokato's help to generate our representation from those.

robertbastian commented 1 year ago

Closing this for now as LSTM and dictionary data is done, and rules is a bigger Unicode-wide undertaking for Q3.

sffc commented 1 year ago

Follow-up: https://github.com/unicode-org/icu4x/issues/3457