mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Stalled metrics are not reported for backward model #846

Open eu9ene opened 1 month ago

eu9ene commented 1 month ago

The bleu/chrf/ce-mean-words_stalled metrics are supposed to be on the online training dashboards but they are missing.

https://wandb.ai/moz-translations/zh-en/runs/backwards_E8S5g?nw=nwuserepavlov

I do see the messages in the log https://firefox-ci-tc.services.mozilla.com/tasks/Q5CG-D5kQaqZQGYukCsrcg/runs/1/logs/live/public/logs/live.log

[task 2024-09-13T16:39:03.768Z] [2024-09-13 16:39:03] [valid] Ep. 1 : Up. 360000 : chrf : 31.403 : stalled 2 times (last best: 31.8198)
[task 2024-09-13T16:39:05.636Z] [2024-09-13 16:39:05] [valid] Ep. 1 : Up. 360000 : ce-mean-words : 2.46139 : stalled 2 times (last best: 2.45698)
[task 2024-09-13T16:39:19.532Z] [2024-09-13 16:39:19] [valid] Ep. 1 : Up. 360000 : bleu-detok : 35.3306 : stalled 2 times (last best: 35.4744)
vrigal commented 1 month ago

I was not able to reproduce locally (I always have the stalled metrics published, from group, file and stream as in the CI).

But here it seems like an issue of another level here: most data have been lost for chrf, bleu_detok and ce_mean_words.

I can only see the 2 first steps from the dashboard (step 10.000 and step 20.000), which does not contain stalled values yet (and it is normal, it starts being logged from step 80.000 for chrf_stalled). There should be values up to step 390.000 for those (each 10.000). You can compare with the experiment I published from logs : https://wandb.ai/teklia/zh-en/runs/backwards_E8S5g.

I would suspect an error publishing results to Weight & Biases, but I was never able to reproduce such a case.

eu9ene commented 1 month ago

As discussed the issue is that it didn't continue logging metrics after the task restart.