Open eu9ene opened 1 month ago
I was not able to reproduce locally (I always have the stalled
metrics published, from group, file and stream as in the CI).
But here it seems like an issue of another level here: most data have been lost for chrf
, bleu_detok
and ce_mean_words
.
I can only see the 2 first steps from the dashboard (step 10.000 and step 20.000), which does not contain stalled values yet (and it is normal, it starts being logged from step 80.000 for chrf_stalled
). There should be values up to step 390.000 for those (each 10.000). You can compare with the experiment I published from logs : https://wandb.ai/teklia/zh-en/runs/backwards_E8S5g.
I would suspect an error publishing results to Weight & Biases, but I was never able to reproduce such a case.
As discussed the issue is that it didn't continue logging metrics after the task restart.
The bleu/chrf/ce-mean-words_stalled metrics are supposed to be on the online training dashboards but they are missing.
https://wandb.ai/moz-translations/zh-en/runs/backwards_E8S5g?nw=nwuserepavlov
I do see the messages in the log https://firefox-ci-tc.services.mozilla.com/tasks/Q5CG-D5kQaqZQGYukCsrcg/runs/1/logs/live/public/logs/live.log