mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
145 stars 31 forks source link

Make merging back-translations more reliable #675

Open eu9ene opened 3 months ago

eu9ene commented 3 months ago

It seems we're having issues with the merging code for the second time: https://github.com/mozilla/firefox-translations-training/blob/fd2f7da7a47eaeb9dde92a47f250afb000edb465/pipeline/translate/collect.sh#L41

This performs differently on different machines.

I attached to the collect task and see this behaviour:

root@6d9ffc892259:~/fetches# ls
file.10.out.zst  file.12.out.zst  file.14.out.zst  file.16.out.zst  file.18.out.zst  file.1.out.zst   file.2.out.zst  file.4.out.zst  file.6.out.zst  file.8.out.zst  mono.en.zst
file.11.out.zst  file.13.out.zst  file.15.out.zst  file.17.out.zst  file.19.out.zst  file.20.out.zst  file.3.out.zst  file.5.out.zst  file.7.out.zst  file.9.out.zst
root@6d9ffc892259:~/fetches# find . -name '*.out.zst' | sort -t '.' -k2,2n
./file.10.out.zst
./file.11.out.zst
./file.12.out.zst
./file.13.out.zst
./file.14.out.zst
./file.15.out.zst
./file.16.out.zst
./file.17.out.zst
./file.18.out.zst
./file.19.out.zst
./file.1.out.zst
./file.20.out.zst
./file.2.out.zst
./file.3.out.zst
./file.4.out.zst
./file.5.out.zst
./file.6.out.zst
./file.7.out.zst
./file.8.out.zst
./file.9.out.zst

We should make each translate task write to artifacts not only the output but also the input to make this all reliable. Then the collect task would also output the full corpus and not only the translated part which makes it easier to debug the pipeline.

eu9ene commented 3 months ago

It turned out the issue was not in sorting (see https://github.com/mozilla/firefox-translations-training/issues/669#issuecomment-2164118871) but it still would be great to reduce risk here by using merge-mono step only once and then copying input to artifacts in other steps. Then we won't ever have a misaligned corpus.