fix signal_file_path to avoid race condition

mosaicml / llm-foundry

LLM training code for Databricks foundation models

https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Apache License 2.0

3.84k stars 503 forks source link

fix signal_file_path to avoid race condition #1253

Closed ofivite closed 3 weeks ago

ofivite commented 1 month ago

In case of using slurm with --array option, sometimes it happens that multiple tasks start running ~simultaneously. Since the filename of the lock doesn't take into account this parallelism, the tasks start creating/writing/deleting the same file, and create a race condition. Appending SLURM_JOB_ID to the filename seems to fix it.

ofivite commented 1 month ago

@ofivite can you elaborate on why this is necessary? I'm not quite sure I understand how it has a race condition since only rank 0 should be writing this file

sure, it's not really ranks which write into the same file, but different slurm jobs which happen to start running at the same time on the cluster, e.g. in our case via an array of tasks from sbatch --array=0-31

mvpatel2000 commented 1 month ago

For a more general solution, should we instead append some random 6-digit SHA? We could then replace slurm ID and node rank + future proof

ofivite commented 1 month ago

In fact, the same situation apparently happens also in streaming when creating a shared memory here. I was just thinking to open the similar fix PR there too :)

ofivite commented 1 month ago

For a more general solution, should we instead append some random 6-digit SHA? We could then replace slurm ID and node rank + future proof

yep, think it's better to generalise it that way!

ofivite commented 1 month ago

Although I'm not quite sure, will each slurm task / rank have its own unique SHA in that case?

mvpatel2000 commented 1 month ago

Hm I guess this doesn't work since you need each node to have the same path...

I guess your original solution is probably best 🤔. @dakinggg any other ideas?

dakinggg commented 1 month ago

@ofivite Would attaching the run name as the unique id resolve your issue? Its technically not foolproof, because nothing forces two runs to use different run names, but probably still an improvement? The issue with slurm id is that it would only work for slurm (although maybe this is also still an improvement)

janEbert commented 1 month ago

@dakinggg that would resolve the issue indeed. :)

I would still argue for including some other form of identifier into the name so that in the future, other people do not randomly run into this issue. There's still a potential race condition on systems that only initialize random seeds with system time, but if we ignore that, the name could simply be broadcasted from rank 0 to the connected processes using torch.distributed.

janEbert commented 1 month ago

Note that if implementing the broadcasting version, the current code would then use the same path across all nodes (dist.get_node_rank() is not included in the path anymore), which does not agree with the previous version.

dakinggg commented 1 month ago

Ok I think the right solution is to implement a utility in composer that gets a signal file name + a random id that gets broadcast from rank 0. That way it will be unique per usage. Would be happy to accept a contribution for this!

dakinggg commented 3 weeks ago

FYI, started working on this here: https://github.com/mosaicml/composer/pull/3396. I'm gonna go ahead and close this PR, thanks for raising the issue!