mosaicml / composer

Supercharge Your Model Training
http://docs.mosaicml.com
Apache License 2.0
5.12k stars 415 forks source link

Busy wait utils in dist #3396

Closed dakinggg closed 3 months ago

dakinggg commented 3 months ago

What does this PR do?

This PR adds additional utils for busy waiting on a node using a signal file lock. The main difference between the added utils and the current approach used is that the signal file has a randomly generated identified appended to it, to better support use cases that include multiple runs using a shared file system that may interfere with each other. Future PRs will replace the signal file uses in Composer (and after release Foundry) with these utils, but just want to start with the util implementation to keep PRs small.

What issue(s) does this change relate to?

Related to https://github.com/mosaicml/llm-foundry/pull/1253

Before submitting