mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Improve translation of URLs #736

Open eu9ene opened 3 months ago

eu9ene commented 3 months ago

Sometimes URLs are written in text rather than hidden behind the HTML element. The URL should be copied as is in this case.

There are two ways to fix this:

  1. Maybe an easier way: identify a URL with a regex on the translation engine side and copy it without passing to the model
  2. Add data augmentation to insert URLs in some training examples and retrain the models
marco-c commented 3 months ago

Could also be a data cleaning problem, like num_mismatch.

gregtatum commented 3 months ago

I would push back against implementing option 1, which would happen on the Gecko side for every translation. That regex seems risky and error prone to write. I would at least start with data augmentation. There is https://github.com/hplt-project/OpusTrainer/issues/43 already on file.

eu9ene commented 3 months ago

See also https://github.com/hplt-project/OpusTrainer/issues/43

jeremiahlee commented 1 month ago

I'm seeing this happen often with Svenska-to-English translations.