ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.55k stars 5.69k forks source link

Running multiple instances (clusters) of ray on the same node with slurm is unstable #36554

Open YanivO1123 opened 1 year ago

YanivO1123 commented 1 year ago

What happened + What you expected to happen

I believe some of what I'm reporting is known, but I couldn't find a single source that summarizes my use case + observed issues. I'm using ray for thread-management (for reinforcement learning) with custom code (specifically this repository: https://github.com/YeWR/EfficientZero). I'm deploying independent slurm jobs, that each attempt to start a ray cluster (with ray.init(num_gpus=args.num_gpus, num_cpus=args.num_cpus, address='local'), or without the address='local' flag) to manage the different workers for the specific job (self-play, replay buffer, storage, training workers etc). When the jobs are assigned to the same node by slurm, the jobs are unstable. Specifically, the most common error is: logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable. In some of these cases, this (or similar) errors lead the program to crash at least some of the workers, but not the ray cluster itself. Ray keeps writing down into the logs infinitely, resulting in over-filling tmp and distabilizing nodes on the cluster.

Related questions -

  1. Is this use case (attempting to deploy multiple, independent ray clusters for workers-management to the same node), known to not work with slurm? Are there workarounds? I don't believe concentrating all of my different seeds / agents in one large and centralized ray cluster is a solution for my setting, because: 1) this will prevent slurm from being able to schedule the different jobs differently, which is part of the policy of the cluster I have access to. 2) other users using the cluster with ray could still observe similar problems (or cause similar problems to my jobs).
  2. Is it possible to specify ray with a max logs size, to prevent logs from growing infinitely in such an instance (something crashed, error logs keep being written overfilling tmp, causing instability on the machine ray runs on)?

Versions / Dependencies

ray, version 2.4.0

Reproduction script

Submitting multiple jobs to the same node using slurm, with this command python main.py --env BreakoutNoFrameskip-v4 --case atari --opr train --amp_type torch_amp --num_gpus 1 --num_cpus 10 --cpu_actor 1 --gpu_actor 1 --force with this repository https://github.com/YeWR/EfficientZero

Issue Severity

Medium: It is a significant difficulty but I can work around it.

BrunoBelucci commented 4 weeks ago

Have you found a solution or workaround? I am struggling with exactly the same thing!

YanivO1123 commented 4 weeks ago

Have you found a solution or workaround? I am struggling with exactly the same thing!

Yes, through some experimentation we found the following:

  1. It seems that at most two jobs can use the same node without interfering. Every deployment of a third job on the same node caused at least one job to crash. So I've changed my deployment setup to not submit more than two jobs on the same node.
  2. It seems that even with two jobs, attempting to initialize both at roughly the same time results in crashing rather reliably, so again, I changed my setup to only deploy the second job to the same node after the first initalized succesfully and started running.
BrunoBelucci commented 4 weeks ago

Those are some pretty severe limitations. I don't think that it will work in my case as I am launching 10 to 20 jobs that initialize a ray cluster locally in the same node. At first it seemed to work, but then something went wrong and I manage to slow down the whole cluster, even nodes that were not being used....I suspect that this has something to do with the many threads that ray spawns, because I am already reaching the limit of threads. I will try to investigate a bit more to see if I can find some other workaround.