Closed eu9ene closed 4 months ago
I should also note that because this file is in merge-mono's cache resources, that this patch will bust caches. This is a feature going forward, but may not be what we want for the current training runs. @gabrielBusta is looking into whether or not previous_group_ids
can help avoid rerunning the mono
chain.
@bhearsum thanks for the detailed analysis! Yes, your understanding is correct. The goal is to make this step deterministic and still reuse the same tasks across runs which will save some resources when training at scale. It's indeed a good idea to run one model first and then use the cached mono branch by the others.
We'll need to run many more training pairs in this training package, so we should implement a proper fix regardless of our approach to the 5 failed teachers. My current idea is to just reuse the backward model for them and rerun everything else with the busted caches which should guarantee the correct behaviour, otherwise, it would be hard to sync the two different paths of merge-mono+split+collect. You can see the configs for the affected languages that follow this idea here. The downside is that it will rerun the decoding with the backwards model. I'm open to exploring other options too.
We use the same code here, so it should work:
https://github.com/mozilla/firefox-translations-training/blob/fd2f7da7a47eaeb9dde92a47f250afb000edb465/pipeline/translate/merge-corpus.sh#L57
I also checked randomization in other places of the pipeline and we have seeds set there.
fixes #669