Teacher training regression

eu9ene commented 2 weeks ago

I see a super weird major regression in teacher training quality after the fix of the Sacrebleu importer in the release branch.

The configs for uk-en were identical, moreover sacrebleu is not used in dev/test datasets, so there's no reason for it to regress. I see the same for the other restarted models: cs-en, el-en, ro-en, sk-en. COMET evals are ~0.5 (meaning garbage) for all of them.

Proper training uk-en train action Regressed train action

Proper training run Regressed run

@bhearsum @gabrielBusta I wonder if it can be somehow related to infrastructure. Did we implement any changes on infra level recently that might affect the release branch?

eu9ene commented 2 weeks ago

I reran uk-en from the same decision task as the proper training run and it follows the path of the wrong one: https://firefox-ci-tc.services.mozilla.com/tasks/V4SL4XEWS6SM1eM8kPiqjw/runs/0.

I found differences in train logs:

 ### Number of empty alignments for /home/ubuntu/tasks/task_id/fetches/mono.aln.zst: 162528702

vs

 ### Number of empty alignments for /home/ubuntu/tasks/task_id/fetches/mono.aln.zst: 2555

So for the incorrect training, almost all the alignments for back-translations are empty which means OpusTrainer will skip those examples.

@gabrielBusta so, you changed the machine b-cpu-xlargedisk for alignments, right? Were there any changes in the environment of the machine? OS, Docker image etc.

eu9ene commented 2 weeks ago

I see that eflomal has some C++ code that might behave differently in a different system environment https://github.com/robertostling/eflomal/tree/master/src

gabrielBusta commented 2 weeks ago

That is so strange. I did change b-cpu-xlargedisk. I added more RAM (and the required CPUs to support it.) Then terminated the workers. I don't see how that would affect the number of empty alignments 👀

gabrielBusta commented 2 weeks ago

Maybe it's related to #658? Seems suspicious that the regression started after that landed

eu9ene commented 2 weeks ago

No, I reran the uk-en pair from the same train action as the first run without the sacrebleu fix and I see the same incorrect behavior because now it uses the newly produced alignments. And this happened for all languages that were launched after the fix I think.

gabrielBusta commented 2 weeks ago

Bug 1902080: Refactor translations b-linux-large-gcp-1tb worker pools to allow for experimentation with different configurations

eu9ene commented 2 weeks ago

update: the issue is in the back-translated data, it is not a parallel corpus

eu9ene commented 2 weeks ago

ok, the next hypothesis is that the new machine has affected the sorting behaviour in translate/collect.sh script:

I attached to collect-mono-trg task and tried the code there, so it sorts incorrectly which results in an incorrect merging of translated texts. We had this issue before and fixed it with this new code.

root@6d9ffc892259:~/fetches# find . -name '*.out.zst' | sort -t '.' -k2,2n
./file.10.out.zst
./file.11.out.zst
./file.12.out.zst
./file.13.out.zst
./file.14.out.zst
./file.15.out.zst
./file.16.out.zst
./file.17.out.zst
./file.18.out.zst
./file.19.out.zst
./file.1.out.zst
./file.20.out.zst
./file.2.out.zst
./file.3.out.zst
./file.4.out.zst
./file.5.out.zst
./file.6.out.zst
./file.7.out.zst
./file.8.out.zst
./file.9.out.zst

bhearsum commented 2 weeks ago

For the record, nothing changed about the OS - we use the exact same image as we did with fewer CPUs. The CPU type also didn't change. The only thing we did was add more CPUs and RAM. (If there's any code that parallelizes across cores I guess I could imagine this theoretically being able to influence behaviour.)

eu9ene commented 2 weeks ago

Ok, I actually tested it incorrectly. This gives me the correct sorting, so it's probably something else:

export chunks_dir=fetches
find "${chunks_dir}" -name '*.out.zst' | sort -t '.' -k2,2n
fetches/file.1.out.zst
fetches/file.2.out.zst
fetches/file.3.out.zst
fetches/file.4.out.zst
fetches/file.5.out.zst
fetches/file.6.out.zst
fetches/file.7.out.zst
fetches/file.8.out.zst
fetches/file.9.out.zst
fetches/file.10.out.zst
fetches/file.11.out.zst
fetches/file.12.out.zst
fetches/file.13.out.zst
fetches/file.14.out.zst
fetches/file.15.out.zst
fetches/file.16.out.zst
fetches/file.17.out.zst
fetches/file.18.out.zst
fetches/file.19.out.zst
fetches/file.20.out.zst

But the fact is that the sentences in collect-mono-trg and merge-mono-trg are not properly aligned, so it's basically noise. This explains the behavior of the aligner and teacher training.

eu9ene commented 2 weeks ago

Ok, I think I know what's going on. Let's look at the task graph of the failed task group: train-teacher and alignments-backtranslated depend on one merge-mono-trg task but if we go from collect-mono-trg -> translate -> split-mono-trg it depends on another merge-mono-trg task. This happens because our target mono dataset is the same for all xx-en runs and is shared across all task groups. The issue is that if we run several trainings simultaneously, we get multiple different mono en tasks and I guess the latest overrides the cache. Check out this funny graph.

bhearsum commented 2 weeks ago

Ok, I think I know what's going on. Let's look at the task graph of the failed task group: train-teacher and alignments-backtranslated depend on one merge-mono-trg task but if we go from collect-mono-trg -> translate -> split-mono-trg it depends on another merge-mono-trg task. This happens because our target mono dataset is the same for all xx-en runs and is shared across all task groups. The issue is that if we run several trainings simultaneously, we get multiple different mono en tasks and I guess the latest overrides the cache. Check out this funny graph.

It sounds like we should probably add the src+trg information to the cache resources in this case. Eg: a from-parameters section in https://github.com/mozilla/firefox-translations-training/blob/fd2f7da7a47eaeb9dde92a47f250afb000edb465/taskcluster/kinds/merge-mono/kind.yml#L25 similar to https://github.com/mozilla/firefox-translations-training/blob/fd2f7da7a47eaeb9dde92a47f250afb000edb465/taskcluster/kinds/translate-corpus/kind.yml#L38-L40.

Or we could make sure the src+trg is included in the label.

bhearsum commented 2 weeks ago

It looks like we previously had src+trg in the label, but it was removed in https://github.com/mozilla/firefox-translations-training/commit/be065b5465199984da48562ba17ebd585667b0eb#diff-ed01a886331a09bef4a6ef2bef2ca8ed8b721579d5cd7e33316b68a2f2b7a0e8, fwiw.

eu9ene commented 2 weeks ago

The idea was to reuse the cache for mono tasks and it's working but the issue here is that merge-mono is not deterministic because it includes data shuffling. Basically having the same inputs gives you different results. Otherwise, this would work. If we add src and trg as cache keys we still can hit this issue by running two uk-en train runs at the same time. Another workaround would be to use a random seed to make the shuffling deterministic.

eu9ene commented 1 week ago

I made shuffling deterministic, so closing this

mozilla / firefox-translations-training

Teacher training regression #669