Open YanivO1123 opened 1 year ago
Have you found a solution or workaround? I am struggling with exactly the same thing!
Have you found a solution or workaround? I am struggling with exactly the same thing!
Yes, through some experimentation we found the following:
Those are some pretty severe limitations. I don't think that it will work in my case as I am launching 10 to 20 jobs that initialize a ray cluster locally in the same node. At first it seemed to work, but then something went wrong and I manage to slow down the whole cluster, even nodes that were not being used....I suspect that this has something to do with the many threads that ray spawns, because I am already reaching the limit of threads. I will try to investigate a bit more to see if I can find some other workaround.
What happened + What you expected to happen
I believe some of what I'm reporting is known, but I couldn't find a single source that summarizes my use case + observed issues. I'm using ray for thread-management (for reinforcement learning) with custom code (specifically this repository: https://github.com/YeWR/EfficientZero). I'm deploying independent slurm jobs, that each attempt to start a ray cluster (with
ray.init(num_gpus=args.num_gpus, num_cpus=args.num_cpus, address='local')
, or without theaddress='local'
flag) to manage the different workers for the specific job (self-play, replay buffer, storage, training workers etc). When the jobs are assigned to the same node by slurm, the jobs are unstable. Specifically, the most common error is:logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
. In some of these cases, this (or similar) errors lead the program to crash at least some of the workers, but not the ray cluster itself. Ray keeps writing down into the logs infinitely, resulting in over-filling tmp and distabilizing nodes on the cluster.Related questions -
Versions / Dependencies
ray, version 2.4.0
Reproduction script
Submitting multiple jobs to the same node using slurm, with this command
python main.py --env BreakoutNoFrameskip-v4 --case atari --opr train --amp_type torch_amp --num_gpus 1 --num_cpus 10 --cpu_actor 1 --gpu_actor 1 --force
with this repository https://github.com/YeWR/EfficientZeroIssue Severity
Medium: It is a significant difficulty but I can work around it.