ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.66k stars 5.72k forks source link

[slurm] CPU resource allocation not respected on SLURM clusters #28586

Open smorad opened 2 years ago

smorad commented 2 years ago

What happened + What you expected to happen

Related issue: https://github.com/ray-project/ray/issues/13607

Ray will bypass CPU limits set by SLURM and access all available CPUs. This is a significant blocker in running ray on SLURM systems, because ray will pick cores in-use by other jobs. This means significantly slower jobs for both the ray user and whoever was allocated the CPUs ray is using.

Doing something like ray_init(cpus=SLURM_CPU_LIMIT) will not fix this. For example, if another job on the same node is using CPU IDs 0:SLURM_CPU_LIMIT, we will grab those same CPU IDs 0:SLURM_CPU_LIMIT instead of SLURM_CPU_LIMIT:2 * SLURM_CPU_LIMIT CPU IDs.

Versions / Dependencies

2.0.0

Reproduction script

SLURM directives

#!/bin/bash
#! Name of the job:
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=12:00:00
#SBATCH --cpus-per-task=1
#SBATCH --tasks-per-node=1
== Status ==
Current time: 2022-09-17 19:21:12 (running for 00:00:11.75)
Memory usage on this node: 53.7/1007.1 GiB
Using FIFO scheduling algorithm.
Resources requested: 5.0/128 CPUs, 0.14/1 GPUs, 0.0/801.35 GiB heap, 0.0/186.26 GiB objects (0.0/1.0 accelerator_type:A100)
Result logdir: /tmp/ray_results/PPO
Number of trials: 9/9 (8 PENDING, 1 RUNNING)

You can clearly see which of my jobs are scheduled on the same nodes from the time per training iteration, resulting in a 4x slowdown. W B Chart 17_09_2022, 19_45_58

Issue Severity

Medium: It is a significant difficulty but I can work around it.

rickyyx commented 1 year ago

Hey @tupui assigning this to you as I heard from Dharhas that u have been looking into it.

aruhela commented 1 year ago

Any update on this ticket. I am also seeing similar issues .

anyscalesam commented 1 year ago

@peytondmurray per our discussion can you please follow up on next steps for this ticket?