Override W&B data on a resumed training

mozilla / translations

The code, training pipeline, and models that power Firefox Translations

https://mozilla.github.io/translations/

Mozilla Public License 2.0

154 stars 33 forks source link

Override W&B data on a resumed training #595

Closed vrigal closed 5 months ago

vrigal commented 5 months ago

Closes #594

vrigal commented 5 months ago

I applied the changes (to resume an existing run on W&B) and tested here: https://wandb.ai/teklia/test/runs/tv2vtdha

we can just pass the artifacts dir to the script parse_tc_logs --from-stream -v --wandb-artifacts ${model_dir}.

Its probably worth opening an issue for this. W&B seems to sync (override) logs & artifacts automatically. (https://wandb.ai/teklia/test/groups/test/files/output.log).

One complication here is that we continue training from the last checkpoint so there will be small overlap with previously written data

W&B should handle this automatically, as it ignores data with a step inferior the last data written. I think it should simply trigger some warnings from the W&B client.

eu9ene commented 5 months ago

@vrigal one thing I don't understand is who sets "RUN_ID" env?

bhearsum commented 5 months ago

@vrigal one thing I don't understand is who sets "RUN_ID" env?

This is set automatically by generic-worker before the task starts: https://github.com/taskcluster/taskcluster/blob/a85c8b9f7be096f6b9a4bad38612374b9a702372/workers/generic-worker/multiuser_posix.go#L146-L148

vrigal commented 5 months ago

why don't we just use resume="allow" so that if the run already exists, it continues automatically? I don't think there is a use case where we have a run with the same name when running things on Taskcluster except the restart of the same task. I guess this logic was implemented in the publisher for publishing offline experiments to prevent republishing the same ones. In this case, the publisher should accept an argument set by a cli that indicates what to do if the run already exists.

This is an old issue. The simpler way to handle this would be to use <group>-<model> as a unique ID in W&B. It should be possible to keep resume="allow" then. It would be compatible with the override option and would probably work in most case (W&B drops overlapped data, as you mentioned), but requires some important changes in the code.

eu9ene commented 5 months ago

why don't we just use resume="allow" so that if the run already exists, it continues automatically? I don't think there is a use case where we have a run with the same name when running things on Taskcluster except the restart of the same task. I guess this logic was implemented in the publisher for publishing offline experiments to prevent republishing the same ones. In this case, the publisher should accept an argument set by a cli that indicates what to do if the run already exists.

This is an old issue. The simpler way to handle this would be to use <group>-<model> as a unique ID in W&B. It should be possible to keep resume="allow" then. It would be compatible with the override option and would probably work in most case (W&B drops overlapped data, as you mentioned), but requires some important changes in the code.

We can rethink all that in #408, but I would use UID in model names as a last resort because they would clutter the dashboards.

vrigal commented 5 months ago

@eu9ene to be clear, run ID (used to identify a run) is different that run name (used to display graphs). For now we do not use an ID, it is automatically set by W&B (e.g. brmhnekj) which guarantees unicity. But if there is a unique way to identify a run (e.g. <group>-<model_name>) it could help with deletion(override) or resuming a run. It can be a separate issue than #408. I'll write an issue for this.