Open smorad opened 2 years ago
Hey @tupui assigning this to you as I heard from Dharhas that u have been looking into it.
Any update on this ticket. I am also seeing similar issues .
@peytondmurray per our discussion can you please follow up on next steps for this ticket?
What happened + What you expected to happen
Related issue: https://github.com/ray-project/ray/issues/13607
Ray will bypass CPU limits set by SLURM and access all available CPUs. This is a significant blocker in running ray on SLURM systems, because ray will pick cores in-use by other jobs. This means significantly slower jobs for both the ray user and whoever was allocated the CPUs ray is using.
Doing something like
ray_init(cpus=SLURM_CPU_LIMIT)
will not fix this. For example, if another job on the same node is using CPU IDs0:SLURM_CPU_LIMIT
, we will grab those same CPU IDs0:SLURM_CPU_LIMIT
instead ofSLURM_CPU_LIMIT:2 * SLURM_CPU_LIMIT
CPU IDs.Versions / Dependencies
2.0.0
Reproduction script
SLURM directives
You can clearly see which of my jobs are scheduled on the same nodes from the time per training iteration, resulting in a 4x slowdown.
Issue Severity
Medium: It is a significant difficulty but I can work around it.