mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Investigate switching to ICU segmenter #860

Open eu9ene opened 1 month ago

eu9ene commented 1 month ago

We currently use Moses tokenizer for alignments because it seems like it's a standard in the MT world and it's what OpusTrainer supports for detokenization (we will likely feed tokenized text to it to support CJK).

We should investigate switching to the ICU segmenter because we use it in Firefox during inference.

Greg:

Two things that could be helpful here are the ICU segmenter (which is equivalent to the powerful Intl.Segmenter in javascript. I believe there are ICU bindings available in Python.

Tutorial: https://www.bedrick.org/notes/python-icu-bindings/

The other package is the unicodedata one, which gives you more information about codepoints.

https://docs.python.org/3/library/unicodedata.html

You can also write regexes using character properties.

It's also important to run normalization on the text (so the codepoints are all represented consistently), which I don't think the opus tools have done a good job using consistently.

I think we should lean in to our OpusCleaner fork for fixes here, and once we are done see if they want to accept any of our fixes.