Closed eu9ene closed 1 week ago
I reran uk-en
from the same decision task as the proper training run and it follows the path of the wrong one: https://firefox-ci-tc.services.mozilla.com/tasks/V4SL4XEWS6SM1eM8kPiqjw/runs/0.
I found differences in train logs:
### Number of empty alignments for /home/ubuntu/tasks/task_id/fetches/mono.aln.zst: 162528702
vs
### Number of empty alignments for /home/ubuntu/tasks/task_id/fetches/mono.aln.zst: 2555
So for the incorrect training, almost all the alignments for back-translations are empty which means OpusTrainer will skip those examples.
@gabrielBusta so, you changed the machine b-cpu-xlargedisk
for alignments, right? Were there any changes in the environment of the machine? OS, Docker image etc.
I see that eflomal has some C++ code that might behave differently in a different system environment https://github.com/robertostling/eflomal/tree/master/src
That is so strange. I did change b-cpu-xlargedisk
. I added more RAM (and the required CPUs to support it.) Then terminated the workers. I don't see how that would affect the number of empty alignments 👀
Maybe it's related to #658? Seems suspicious that the regression started after that landed
No, I reran the uk-en pair from the same train action as the first run without the sacrebleu fix and I see the same incorrect behavior because now it uses the newly produced alignments. And this happened for all languages that were launched after the fix I think.
update: the issue is in the back-translated data, it is not a parallel corpus
ok, the next hypothesis is that the new machine has affected the sorting behaviour in translate/collect.sh
script:
I attached to collect-mono-trg
task and tried the code there, so it sorts incorrectly which results in an incorrect merging of translated texts. We had this issue before and fixed it with this new code.
root@6d9ffc892259:~/fetches# find . -name '*.out.zst' | sort -t '.' -k2,2n
./file.10.out.zst
./file.11.out.zst
./file.12.out.zst
./file.13.out.zst
./file.14.out.zst
./file.15.out.zst
./file.16.out.zst
./file.17.out.zst
./file.18.out.zst
./file.19.out.zst
./file.1.out.zst
./file.20.out.zst
./file.2.out.zst
./file.3.out.zst
./file.4.out.zst
./file.5.out.zst
./file.6.out.zst
./file.7.out.zst
./file.8.out.zst
./file.9.out.zst
For the record, nothing changed about the OS - we use the exact same image as we did with fewer CPUs. The CPU type also didn't change. The only thing we did was add more CPUs and RAM. (If there's any code that parallelizes across cores I guess I could imagine this theoretically being able to influence behaviour.)
Ok, I actually tested it incorrectly. This gives me the correct sorting, so it's probably something else:
export chunks_dir=fetches
find "${chunks_dir}" -name '*.out.zst' | sort -t '.' -k2,2n
fetches/file.1.out.zst
fetches/file.2.out.zst
fetches/file.3.out.zst
fetches/file.4.out.zst
fetches/file.5.out.zst
fetches/file.6.out.zst
fetches/file.7.out.zst
fetches/file.8.out.zst
fetches/file.9.out.zst
fetches/file.10.out.zst
fetches/file.11.out.zst
fetches/file.12.out.zst
fetches/file.13.out.zst
fetches/file.14.out.zst
fetches/file.15.out.zst
fetches/file.16.out.zst
fetches/file.17.out.zst
fetches/file.18.out.zst
fetches/file.19.out.zst
fetches/file.20.out.zst
But the fact is that the sentences in collect-mono-trg and merge-mono-trg are not properly aligned, so it's basically noise. This explains the behavior of the aligner and teacher training.
Ok, I think I know what's going on. Let's look at the task graph of the failed task group: train-teacher and alignments-backtranslated depend on one merge-mono-trg task but if we go from collect-mono-trg -> translate -> split-mono-trg it depends on another merge-mono-trg task. This happens because our target mono dataset is the same for all xx-en runs and is shared across all task groups. The issue is that if we run several trainings simultaneously, we get multiple different mono en tasks and I guess the latest overrides the cache. Check out this funny graph.
Ok, I think I know what's going on. Let's look at the task graph of the failed task group: train-teacher and alignments-backtranslated depend on one merge-mono-trg task but if we go from collect-mono-trg -> translate -> split-mono-trg it depends on another merge-mono-trg task. This happens because our target mono dataset is the same for all xx-en runs and is shared across all task groups. The issue is that if we run several trainings simultaneously, we get multiple different mono en tasks and I guess the latest overrides the cache. Check out this funny graph.
It sounds like we should probably add the src+trg information to the cache resources in this case. Eg: a from-parameters
section in https://github.com/mozilla/firefox-translations-training/blob/fd2f7da7a47eaeb9dde92a47f250afb000edb465/taskcluster/kinds/merge-mono/kind.yml#L25 similar to https://github.com/mozilla/firefox-translations-training/blob/fd2f7da7a47eaeb9dde92a47f250afb000edb465/taskcluster/kinds/translate-corpus/kind.yml#L38-L40.
Or we could make sure the src+trg is included in the label.
It looks like we previously had src+trg in the label, but it was removed in https://github.com/mozilla/firefox-translations-training/commit/be065b5465199984da48562ba17ebd585667b0eb#diff-ed01a886331a09bef4a6ef2bef2ca8ed8b721579d5cd7e33316b68a2f2b7a0e8, fwiw.
The idea was to reuse the cache for mono tasks and it's working but the issue here is that merge-mono
is not deterministic because it includes data shuffling. Basically having the same inputs gives you different results. Otherwise, this would work. If we add src and trg as cache keys we still can hit this issue by running two uk-en train runs at the same time. Another workaround would be to use a random seed to make the shuffling deterministic.
I made shuffling deterministic, so closing this
I see a super weird major regression in teacher training quality after the fix of the Sacrebleu importer in the release branch.
The configs for uk-en were identical, moreover sacrebleu is not used in dev/test datasets, so there's no reason for it to regress. I see the same for the other restarted models: cs-en, el-en, ro-en, sk-en. COMET evals are ~0.5 (meaning garbage) for all of them.
Proper training uk-en train action Regressed train action
Proper training run Regressed run
@bhearsum @gabrielBusta I wonder if it can be somehow related to infrastructure. Did we implement any changes on infra level recently that might affect the release branch?