I did not realize Moses tokenizer can modify text. Using aggressive dash splitting leads to dashes represented as @-@ in the tokenized text. The tests didn't catch this because I tokenize text differently there with a Python Moses tokenizer sacremoses and without aggressive dash splitting which is another problem. The C++ based opus-fast-mosestokenizer that we use in prod didn't install on MacOS for me and I wanted this quick test to run without Docker.
The implication is some of the remapped alignments might've been incorrect, but I assume most of the sentences don't include dashes, so it's not critical. It only happens for the words where the dash is a part of the word, for example: semi-colon. I tested it with the bug and it leads to merging all words after the dash into "one word" in their alignments.
I think implications for the teacher training are minor: there's some probability of inserting inline noise in the wrong position in the sentences with dashed words.
As for the student, I think it's more important to land the fix there because we use alignments not only for data augmentation but also pass them to marian as guided-alignments. In this case, we should restart all the tasks starting from alignments-student and shortlist stage.
I did not realize Moses tokenizer can modify text. Using aggressive dash splitting leads to dashes represented as
@-@
in the tokenized text. The tests didn't catch this because I tokenize text differently there with a Python Moses tokenizersacremoses
and without aggressive dash splitting which is another problem. The C++ basedopus-fast-mosestokenizer
that we use in prod didn't install on MacOS for me and I wanted this quick test to run without Docker.The implication is some of the remapped alignments might've been incorrect, but I assume most of the sentences don't include dashes, so it's not critical. It only happens for the words where the dash is a part of the word, for example:
semi-colon
. I tested it with the bug and it leads to merging all words after the dash into "one word" in their alignments.I think implications for the teacher training are minor: there's some probability of inserting inline noise in the wrong position in the sentences with dashed words. As for the student, I think it's more important to land the fix there because we use alignments not only for data augmentation but also pass them to marian as
guided-alignments
. In this case, we should restart all the tasks starting fromalignments-student
andshortlist
stage.