mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
145 stars 31 forks source link

Error on preemption restart: Not Found model.npz.optimizer.npz #667

Closed eu9ene closed 3 months ago

eu9ene commented 3 months ago

https://firefox-ci-tc.services.mozilla.com/tasks/EdZxplswTGWoT5WE6er2CA/runs/10/logs/public/logs/live.log

[task 2024-06-07T00:30:17.388Z] + export PATH=/home/ubuntu/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
[task 2024-06-07T00:30:17.388Z] + PATH=/home/ubuntu/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
[task 2024-06-07T00:30:17.388Z] + export MARIAN=/home/ubuntu/tasks/task_171771988499916/fetches
[task 2024-06-07T00:30:17.388Z] + MARIAN=/home/ubuntu/tasks/task_171771988499916/fetches
[task 2024-06-07T00:30:17.388Z] + ./checkouts/vcs/taskcluster/scripts/pipeline/train_taskcluster.py teacher train lt en /home/ubuntu/tasks/task_171771988499916/fetches/corpus,/home/ubuntu/tasks/task_171771988499916/fetches/mono /home/ubuntu/tasks/task_171771988499916/fetches/devset /home/ubuntu/tasks/task_171771988499916/artifacts chrf /home/ubuntu/tasks/task_171771988499916/fetches/corpus.aln.zst,/home/ubuntu/tasks/task_171771988499916/fetches/mono.aln.zst 1 two-stage None None --early-stopping 20
[task 2024-06-07T00:30:17.479Z] /usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.2.1) or chardet (4.0.0) doesn't match a supported version!
[task 2024-06-07T00:30:17.479Z]   warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
[task 2024-06-07T00:30:17.500Z] INFO:root:run_id > 0, attempting to resume training from an earlier run...
[task 2024-06-07T00:30:17.573Z] INFO:root:Run 9 is missing some necessary artifacts...
[task 2024-06-07T00:30:17.644Z] INFO:root:Run 8 is missing some necessary artifacts...
[task 2024-06-07T00:30:17.878Z] INFO:root:Run 7 is missing some necessary artifacts...
[task 2024-06-07T00:30:17.951Z] INFO:root:Run 6 is missing some necessary artifacts...
[task 2024-06-07T00:30:18.020Z] INFO:root:Run 5 is missing some necessary artifacts...
[task 2024-06-07T00:30:18.087Z] INFO:root:Run 4 is missing some necessary artifacts...
[task 2024-06-07T00:30:18.156Z] INFO:root:Run 3 is missing some necessary artifacts...
[task 2024-06-07T00:30:18.225Z] INFO:root:Run 2 is missing some necessary artifacts...
[task 2024-06-07T00:30:18.297Z] INFO:root:Run 1 is missing some necessary artifacts...
[task 2024-06-07T00:30:18.366Z] INFO:root:Run 0 appears to have the artifacts we need! Downloading them...
[task 2024-06-07T00:30:18.366Z] INFO:root:Fetching public/build/config.opustrainer.yml...
[task 2024-06-07T00:30:18.576Z] INFO:root:Fetching public/build/config.opustrainer.yml.state...
[task 2024-06-07T00:30:18.776Z] INFO:root:Fetching public/build/devset.out...
[task 2024-06-07T00:30:19.074Z] INFO:root:Fetching public/build/model.npz...
[task 2024-06-07T00:30:25.964Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz...
[task 2024-06-07T00:30:28.756Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz.decoder.yml...
[task 2024-06-07T00:30:28.907Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz...
[task 2024-06-07T00:30:36.905Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz.decoder.yml...
[task 2024-06-07T00:30:37.026Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz...
[task 2024-06-07T00:30:44.075Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz.decoder.yml...
[task 2024-06-07T00:30:44.224Z] INFO:root:Fetching public/build/model.npz.decoder.yml...
[task 2024-06-07T00:30:44.373Z] INFO:root:Fetching public/build/model.npz.optimizer.npz...
[task 2024-06-07T00:30:44.510Z] ERROR:root:Caught exception, exiting with distinct code...
[task 2024-06-07T00:30:44.510Z] Traceback (most recent call last):
[task 2024-06-07T00:30:44.510Z]   File "/home/ubuntu/tasks/task_171771988499916/./checkouts/vcs/taskcluster/scripts/pipeline/train_taskcluster.py", line 103, in main
[task 2024-06-07T00:30:44.510Z]     r.raise_for_status()
[task 2024-06-07T00:30:44.510Z]   File "/usr/lib/python3/dist-packages/requests/models.py", line 943, in raise_for_status
[task 2024-06-07T00:30:44.510Z]     raise HTTPError(http_error_msg, response=self)
[task 2024-06-07T00:30:44.510Z] requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://firefoxci.taskcluster-artifacts.net/EdZxplswTGWoT5WE6er2CA/0/public/build/model.npz.optimizer.npz
[fetches 2024-06-07T00:30:44.528Z] removing /home/ubuntu/tasks/task_171771988499916/fetches
[fetches 2024-06-07T00:30:49.007Z] finished
[taskcluster 2024-06-07T00:30:49.017Z]    Exit Code: 17
[taskcluster 2024-06-07T00:30:49.017Z]    User Time: 2m23.91347s
[taskcluster 2024-06-07T00:30:49.017Z]  Kernel Time: 1m14.969402s
[taskcluster 2024-06-07T00:30:49.017Z]    Wall Time: 6m3.173982489s
[taskcluster 2024-06-07T00:30:49.017Z]       Result: FAILED
[taskcluster 2024-06-07T00:30:49.017Z] === Task Finished ===
bhearsum commented 3 months ago

It almost looks as if run #0 didn't manage to upload all of its artifacts. Although model.npz.optimizer.npz is listed there, it is most certainly a 404.

We should probably avoid failing on such a 404, or any other issue retrieving artifacts from a previous run. We can always try again from even earlier runs, and even if all of those are missing artifacts (or we fail to fetch them for some reasons), we can always start from scratch.

At the moment, any 404 causes the task to fail with no way to even unhork the graph :(

bhearsum commented 3 months ago

https://github.com/mozilla/firefox-translations-training/pull/671 should fix this, and applies cleanly to release.

bhearsum commented 3 months ago

Note that any fix won't fix the existing task. I think you'll need to start a new train action with previous_group_ids and start-stage to get this moving again.