mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Do not use WMTNews as training! #911

Open ZJaume opened 2 days ago

ZJaume commented 2 days ago

The WMTNews corpus at OPUS is just a compilation of the WMT test sets, so it must not be included as training

https://github.com/mozilla/firefox-translations-training/blob/1f7ab70cd4dbb64e16bb6b38840490c2f2259cb0/configs/autogenerated/en-tr-spring-2024.yml#L79-L79

https://github.com/mozilla/firefox-translations-training/blob/1f7ab70cd4dbb64e16bb6b38840490c2f2259cb0/configs/autogenerated/en-ro-spring-2024.yml#L123

eu9ene commented 2 days ago

Great catch! We remove it in the find-corpus:

https://github.com/mozilla/firefox-translations-training/blob/1f7ab70cd4dbb64e16bb6b38840490c2f2259cb0/utils/find_corpus.py#L99

but it seems it got lost with all the refactorings and migration to the config generator... cc @gregtatum

The outcome is quite sad as all our WMT based evaluation benchmarks for this cohort of languages are not correct. Flores should be fine

eu9ene commented 2 days ago

@marco-c FYI too

eu9ene commented 1 day ago

ok, maybe not all the results are incorrect but only the ones before 2019