mozilla / translate

Translations website utilizing Bergamot proceedings
https://mozilla.github.io/translate
Mozilla Public License 2.0
59 stars 15 forks source link

Some single words are being duplicated in the translation in EN-RU combination #2

Open acornestean opened 3 years ago

acornestean commented 3 years ago

[Affected versions]: Firefox Nightly (94.0a1/20210908213905)

[Affected Platforms]: Windows 10 x64

[Prerequisites]: Access https://mozilla.github.io/translate/.

[Steps to reproduce]:

  1. On the translation website, set EN-RU as a language combination
  2. In the “From” field, type one single word, for example “mother”, “love”, “people” (these were all I could find that act this way. Some other words like “pear”, ‘pea”, “carrot” etc result in only one appearance of the translated word)
  3. Notice that in the translation field, the translated word appears 2 times.

NOTES:

  1. Adding a second word in the “From” field, after the original one, will cause the duplicate translated word to be replaced by the translation of the second one i.e. “mother” gets translated to “мать мать”, but “mother apple” gets translated to “мать яблоко”.
  2. Capitalizing the word to be translated will result in only one appearance of the translated word i.e. “love” gets translated to “Любить любовь”, but “Love” get translated to “Любовь”.

[Expected]: Translating one word should result in only one appearance of the translated word and not duplicates.

[Actual]: Translating one word should causes the translated word to appear twice.

eu9ene commented 3 years ago

This is a very interesting finding. I updated the model to the latest version, trained on a bigger dataset and it became even funnier: mother is translated as "Мать и сестра" (mother and sister) now! I see similar wrong behaviour for many other single-word examples.

I assume it's because the model was trained only on long sentences and it has never seen single word ones (we have special cleaning rules for this). It might make sense for web page translations but it doesn't for Google Translate kind of user experience. What is weird is that we don't see this problem for other languages. We'll have another pass to improve the quality of models, maybe it will be fixed then.

kpu commented 2 years ago

Clean corpora should be allowed to bypass the bicleaner rules. For better or worse this means a manual mapping of corpus to cleanliness. It's happening for ru because my understanding is @eu9ene's current pipeline puts everything through bicleaner whereas the consortium provided models used the janky manual pipeline in which only some corpora are cleaned with bicleaner.