mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
143 stars 31 forks source link

Improve translation of URLs #736

Open eu9ene opened 1 month ago

eu9ene commented 1 month ago

Sometimes URLs are written in text rather than hidden behind the HTML element. The URL should be copied as is in this case.

There are two ways to fix this:

  1. Maybe an easier way: identify a URL with a regex on the translation engine side and copy it without passing to the model
  2. Add data augmentation to insert URLs in some training examples and retrain the models
marco-c commented 1 month ago

Could also be a data cleaning problem, like num_mismatch.

gregtatum commented 1 month ago

I would push back against implementing option 1, which would happen on the Gecko side for every translation. That regex seems risky and error prone to write. I would at least start with data augmentation. There is https://github.com/hplt-project/OpusTrainer/issues/43 already on file.

eu9ene commented 1 month ago

See also https://github.com/hplt-project/OpusTrainer/issues/43