mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Do not use aggressive dash splitting in tokenization #718

Closed eu9ene closed 4 months ago

eu9ene commented 4 months ago

I did not realize Moses tokenizer can modify text. Using aggressive dash splitting leads to dashes represented as @-@ in the tokenized text. The tests didn't catch this because I tokenize text differently there with a Python Moses tokenizer sacremoses and without aggressive dash splitting which is another problem. The C++ based opus-fast-mosestokenizer that we use in prod didn't install on MacOS for me and I wanted this quick test to run without Docker.

The implication is some of the remapped alignments might've been incorrect, but I assume most of the sentences don't include dashes, so it's not critical. It only happens for the words where the dash is a part of the word, for example: semi-colon. I tested it with the bug and it leads to merging all words after the dash into "one word" in their alignments.

I think implications for the teacher training are minor: there's some probability of inserting inline noise in the wrong position in the sentences with dashed words. As for the student, I think it's more important to land the fix there because we use alignments not only for data augmentation but also pass them to marian as guided-alignments. In this case, we should restart all the tasks starting from alignments-student and shortlist stage.