Open elistevens opened 2 months ago
@elistevens Are you running multiple Tune jobs concurrently, and are they launched at the same time? Could you try specifying a unique RunConfig(name)
for each new job?
Are you running multiple Tune jobs concurrently
To avoid potential miscommunication, I'll be specific: I'm only invoking the program once at a time, and the program only calls ray.tune.Tuner
once. The thing being tuned is a GPU-based job (training MNIST, requesting 4 cores and 1 gpu) that only runs one at a time on my local system (8 cores, 1 gpu).
I get the impression that there are some lightweight tuner processes that get kicked off, which do some setup work but are mostly idle waiting for their turn to kick off a training run. I think these are in parallel when max_concurrent_trials
isn't 1
, but I haven't really confirmed that.
I don't change the RunConfig(name)
per separate invocation of the program. Do I need to, if the sequential invocations aren't running at the same time?
I have also run into this.
See also https://github.com/ray-project/ray/issues/41362
We are using Ray Tune on a cluster with max_concurrent_trials=100
and run into this occasionally.
We are working around by disabling checkpointing, as we don't need that feature.
But the root of the issue is that _atomic_save
is not thread-safe or multiprocess-safe, but is being used in what appears to be a multithreading context (checkpointing after multiple trials complete).
tmp_search_ckpt_path = os.path.join(checkpoint_dir, tmp_file_name)
with open(tmp_search_ckpt_path, "wb") as f:
cloudpickle.dump(state, f)
# another thread from the same driver program jumps in and calls `os.replace` first
os.replace(tmp_search_ckpt_path, os.path.join(checkpoint_dir, file_name)) # Now this line fails
The obvious options for solutions are lockfiles or randomized filenames.
What happened + What you expected to happen
There seems to be some sort of race condition when using Ray Tune. I get the following two errors intermittently when running a ray tune search on my local system and using
tune_config.max_concurrent_trials > 1
. The problem does not occur when settingmax_concurrent_trials = 1
.Here are the two stack traces produced
I tried to overwrite the
_atomic_save
function to use a more specific file name than.tmp_generator
, but that did not meaningfully change the behavior. My next guess would be that a directory is being removed, but I have no evidence of that.The cluster logs mentioned don't contain additional relevant information, nor did
RAY_record_ref_creation_sites=1
offer additional info.Versions / Dependencies
Python 3.10.14 Ray 2.10 and 2.37
Reproduction script
I don't have a minimal repro script.
It's possible that part of the issue is because my ray jobs are either underspecified, or are cheap enough that ray is trying to do "too much" on the local node all at once. I would see multiple tune jobs running at once, even though only one training job fits on the GPU.
Issue Severity
Medium: It is a significant difficulty but I can work around it.