Open acornestean opened 3 years ago
This is a very interesting finding. I updated the model to the latest version, trained on a bigger dataset and it became even funnier: mother is translated as "Мать и сестра" (mother and sister) now! I see similar wrong behaviour for many other single-word examples.
I assume it's because the model was trained only on long sentences and it has never seen single word ones (we have special cleaning rules for this). It might make sense for web page translations but it doesn't for Google Translate kind of user experience. What is weird is that we don't see this problem for other languages. We'll have another pass to improve the quality of models, maybe it will be fixed then.
Clean corpora should be allowed to bypass the bicleaner rules. For better or worse this means a manual mapping of corpus to cleanliness. It's happening for ru because my understanding is @eu9ene's current pipeline puts everything through bicleaner whereas the consortium provided models used the janky manual pipeline in which only some corpora are cleaned with bicleaner.
[Affected versions]: Firefox Nightly (94.0a1/20210908213905)
[Affected Platforms]: Windows 10 x64
[Prerequisites]: Access https://mozilla.github.io/translate/.
[Steps to reproduce]:
NOTES:
[Expected]: Translating one word should result in only one appearance of the translated word and not duplicates.
[Actual]: Translating one word should causes the translated word to appear twice.