mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
145 stars 31 forks source link

Make mono shuffling deterministic to fix caching issues #677

Closed eu9ene closed 3 months ago

eu9ene commented 3 months ago

We use the same code here, so it should work:

https://github.com/mozilla/firefox-translations-training/blob/fd2f7da7a47eaeb9dde92a47f250afb000edb465/pipeline/translate/merge-corpus.sh#L57

I also checked randomization in other places of the pipeline and we have seeds set there.

fixes #669

bhearsum commented 3 months ago

I should also note that because this file is in merge-mono's cache resources, that this patch will bust caches. This is a feature going forward, but may not be what we want for the current training runs. @gabrielBusta is looking into whether or not previous_group_ids can help avoid rerunning the mono chain.

eu9ene commented 3 months ago

@bhearsum thanks for the detailed analysis! Yes, your understanding is correct. The goal is to make this step deterministic and still reuse the same tasks across runs which will save some resources when training at scale. It's indeed a good idea to run one model first and then use the cached mono branch by the others.

We'll need to run many more training pairs in this training package, so we should implement a proper fix regardless of our approach to the 5 failed teachers. My current idea is to just reuse the backward model for them and rerun everything else with the busted caches which should guarantee the correct behaviour, otherwise, it would be hard to sync the two different paths of merge-mono+split+collect. You can see the configs for the affected languages that follow this idea here. The downside is that it will rerun the decoding with the backwards model. I'm open to exploring other options too.