Running multiple instances (clusters) of ray on the same node with slurm is unstable

YanivO1123 commented 1 year ago

What happened + What you expected to happen

I believe some of what I'm reporting is known, but I couldn't find a single source that summarizes my use case + observed issues. I'm using ray for thread-management (for reinforcement learning) with custom code (specifically this repository: https://github.com/YeWR/EfficientZero). I'm deploying independent slurm jobs, that each attempt to start a ray cluster (with ray.init(num_gpus=args.num_gpus, num_cpus=args.num_cpus, address='local'), or without the address='local' flag) to manage the different workers for the specific job (self-play, replay buffer, storage, training workers etc). When the jobs are assigned to the same node by slurm, the jobs are unstable. Specifically, the most common error is: logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable. In some of these cases, this (or similar) errors lead the program to crash at least some of the workers, but not the ray cluster itself. Ray keeps writing down into the logs infinitely, resulting in over-filling tmp and distabilizing nodes on the cluster.

Versions / Dependencies

ray, version 2.4.0

Reproduction script

Submitting multiple jobs to the same node using slurm, with this command python main.py --env BreakoutNoFrameskip-v4 --case atari --opr train --amp_type torch_amp --num_gpus 1 --num_cpus 10 --cpu_actor 1 --gpu_actor 1 --force with this repository https://github.com/YeWR/EfficientZero

Issue Severity

Medium: It is a significant difficulty but I can work around it.

BrunoBelucci commented 4 weeks ago

Have you found a solution or workaround? I am struggling with exactly the same thing!

YanivO1123 commented 4 weeks ago

Have you found a solution or workaround? I am struggling with exactly the same thing!

Yes, through some experimentation we found the following:

It seems that at most two jobs can use the same node without interfering. Every deployment of a third job on the same node caused at least one job to crash. So I've changed my deployment setup to not submit more than two jobs on the same node.
It seems that even with two jobs, attempting to initialize both at roughly the same time results in crashing rather reliably, so again, I changed my setup to only deploy the second job to the same node after the first initalized succesfully and started running.

BrunoBelucci commented 4 weeks ago

Those are some pretty severe limitations. I don't think that it will work in my case as I am launching 10 to 20 jobs that initialize a ray cluster locally in the same node. At first it seemed to work, but then something went wrong and I manage to slow down the whole cluster, even nodes that were not being used....I suspect that this has something to do with the many threads that ray spawns, because I am already reaching the limit of threads. I will try to investigate a bit more to see if I can find some other workaround.

ray-project / ray