We currently use Moses tokenizer for alignments because it seems like it's a standard in the MT world and it's what OpusTrainer supports for detokenization (we will likely feed tokenized text to it to support CJK).
We should investigate switching to the ICU segmenter because we use it in Firefox during inference.
Greg:
Two things that could be helpful here are the ICU segmenter (which is equivalent to the powerful Intl.Segmenter in javascript. I believe there are ICU bindings available in Python.
You can also write regexes using character properties.
It's also important to run normalization on the text (so the codepoints are all represented consistently), which I don't think the opus tools have done a good job using consistently.
I think we should lean in to our OpusCleaner fork for fixes here, and once we are done see if they want to accept any of our fixes.
We currently use Moses tokenizer for alignments because it seems like it's a standard in the MT world and it's what OpusTrainer supports for detokenization (we will likely feed tokenized text to it to support CJK).
We should investigate switching to the ICU segmenter because we use it in Firefox during inference.
Greg: