Closed vrigal closed 5 months ago
I applied the changes (to resume an existing run on W&B) and tested here: https://wandb.ai/teklia/test/runs/tv2vtdha
we can just pass the artifacts dir to the script parse_tc_logs --from-stream -v --wandb-artifacts ${model_dir}.
Its probably worth opening an issue for this. W&B seems to sync (override) logs & artifacts automatically. (https://wandb.ai/teklia/test/groups/test/files/output.log).
One complication here is that we continue training from the last checkpoint so there will be small overlap with previously written data
W&B should handle this automatically, as it ignores data with a step inferior the last data written. I think it should simply trigger some warnings from the W&B client.
@vrigal one thing I don't understand is who sets "RUN_ID" env?
@vrigal one thing I don't understand is who sets "RUN_ID" env?
This is set automatically by generic-worker before the task starts: https://github.com/taskcluster/taskcluster/blob/a85c8b9f7be096f6b9a4bad38612374b9a702372/workers/generic-worker/multiuser_posix.go#L146-L148
why don't we just use resume="allow" so that if the run already exists, it continues automatically? I don't think there is a use case where we have a run with the same name when running things on Taskcluster except the restart of the same task. I guess this logic was implemented in the publisher for publishing offline experiments to prevent republishing the same ones. In this case, the publisher should accept an argument set by a cli that indicates what to do if the run already exists.
This is an old issue. The simpler way to handle this would be to use <group>-<model>
as a unique ID in W&B.
It should be possible to keep resume="allow" then. It would be compatible with the override option and would probably work in most case (W&B drops overlapped data, as you mentioned), but requires some important changes in the code.
why don't we just use resume="allow" so that if the run already exists, it continues automatically? I don't think there is a use case where we have a run with the same name when running things on Taskcluster except the restart of the same task. I guess this logic was implemented in the publisher for publishing offline experiments to prevent republishing the same ones. In this case, the publisher should accept an argument set by a cli that indicates what to do if the run already exists.
This is an old issue. The simpler way to handle this would be to use
<group>-<model>
as a unique ID in W&B. It should be possible to keep resume="allow" then. It would be compatible with the override option and would probably work in most case (W&B drops overlapped data, as you mentioned), but requires some important changes in the code.
We can rethink all that in #408, but I would use UID in model names as a last resort because they would clutter the dashboards.
@eu9ene to be clear, run ID (used to identify a run) is different that run name (used to display graphs). For now we do not use an ID, it is automatically set by W&B (e.g. brmhnekj
) which guarantees unicity. But if there is a unique way to identify a run (e.g. <group>-<model_name>
) it could help with deletion(override) or resuming a run. It can be a separate issue than #408. I'll write an issue for this.
Closes #594