Closed ofivite closed 3 weeks ago
@ofivite can you elaborate on why this is necessary? I'm not quite sure I understand how it has a race condition since only rank 0 should be writing this file
sure, it's not really ranks which write into the same file, but different slurm jobs which happen to start running at the same time on the cluster, e.g. in our case via an array of tasks from sbatch --array=0-31
For a more general solution, should we instead append some random 6-digit SHA? We could then replace slurm ID and node rank + future proof
In fact, the same situation apparently happens also in streaming
when creating a shared memory here. I was just thinking to open the similar fix PR there too :)
For a more general solution, should we instead append some random 6-digit SHA? We could then replace slurm ID and node rank + future proof
yep, think it's better to generalise it that way!
Although I'm not quite sure, will each slurm task / rank have its own unique SHA in that case?
Hm I guess this doesn't work since you need each node to have the same path...
I guess your original solution is probably best 🤔. @dakinggg any other ideas?
@ofivite Would attaching the run name as the unique id resolve your issue? Its technically not foolproof, because nothing forces two runs to use different run names, but probably still an improvement? The issue with slurm id is that it would only work for slurm (although maybe this is also still an improvement)
@dakinggg that would resolve the issue indeed. :)
I would still argue for including some other form of identifier into the name so that in the future, other people do not randomly run into this issue. There's still a potential race condition on systems that only initialize random seeds with system time, but if we ignore that, the name could simply be broadcasted from rank 0 to the connected processes using torch.distributed
.
Note that if implementing the broadcasting version, the current code would then use the same path across all nodes (dist.get_node_rank()
is not included in the path anymore), which does not agree with the previous version.
Ok I think the right solution is to implement a utility in composer that gets a signal file name + a random id that gets broadcast from rank 0. That way it will be unique per usage. Would be happy to accept a contribution for this!
FYI, started working on this here: https://github.com/mosaicml/composer/pull/3396. I'm gonna go ahead and close this PR, thanks for raising the issue!
In case of using
slurm
with--array
option, sometimes it happens that multiple tasks start running ~simultaneously. Since the filename of the lock doesn't take into account this parallelism, the tasks start creating/writing/deleting the same file, and create a race condition. AppendingSLURM_JOB_ID
to the filename seems to fix it.