This PR adds additional utils for busy waiting on a node using a signal file lock. The main difference between the added utils and the current approach used is that the signal file has a randomly generated identified appended to it, to better support use cases that include multiple runs using a shared file system that may interfere with each other. Future PRs will replace the signal file uses in Composer (and after release Foundry) with these utils, but just want to start with the util implementation to keep PRs small.
What does this PR do?
This PR adds additional utils for busy waiting on a node using a signal file lock. The main difference between the added utils and the current approach used is that the signal file has a randomly generated identified appended to it, to better support use cases that include multiple runs using a shared file system that may interfere with each other. Future PRs will replace the signal file uses in Composer (and after release Foundry) with these utils, but just want to start with the util implementation to keep PRs small.
What issue(s) does this change relate to?
Related to https://github.com/mosaicml/llm-foundry/pull/1253
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)