Open eu9ene opened 4 months ago
It turned out the issue was not in sorting (see https://github.com/mozilla/firefox-translations-training/issues/669#issuecomment-2164118871) but it still would be great to reduce risk here by using merge-mono
step only once and then copying input to artifacts in other steps. Then we won't ever have a misaligned corpus.
It seems we're having issues with the merging code for the second time: https://github.com/mozilla/firefox-translations-training/blob/fd2f7da7a47eaeb9dde92a47f250afb000edb465/pipeline/translate/collect.sh#L41
This performs differently on different machines.
I attached to the collect task and see this behaviour:
We should make each
translate
task write to artifacts not only the output but also the input to make this all reliable. Then thecollect
task would also output the full corpus and not only the translated part which makes it easier to debug the pipeline.