ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.84k stars 5.57k forks source link

[Core] Incorrect detection of cpus #34846

Open akshara08 opened 1 year ago

akshara08 commented 1 year ago

What happened + What you expected to happen

If I'm understanding correctly, Ray detects number of CPUs using this. However, based on the multiprocessing docs, this is not the number of usable cpus, instead we should be using len(os.sched_getaffinity(0)) to get the usable cpus

I spent all afternoon trying to understanding why Ray was detecting the wrong number of cpus and noticed that multiprocessing.cpu_count() gives the number of all the cpus on a given node - this is problematic if the node is shared resulting in uninformative error. len(os.sched_getaffinity(0)) gives the right number of CPUs, I can make a PR to fix this if it seems okay

Currently, the work around is to set the number of cpus in init ray.init(num_cpus=len(os.sched_getaffinity(0)))

Versions / Dependencies

Slurm on HPC

Reproduction script

I don't think it can be reproduced unless you have a system with shared resources and some are restricted

Issue Severity

Medium: It is a significant difficulty but I can work around it.

grimreaper commented 1 year ago

This is easy to reproduce but has issues if implemented without fallback

[572 20:37:30.964 ubuntu@devbox ~]∴cat testpy.py
import os
import multiprocessing

osg = os.sched_getaffinity(0)
print(osg)
print(len(osg))

mpc = multiprocessing.cpu_count()
print(mpc)

[570 20:37:28.751 ubuntu@devbox ~]∴taskset -c 1 python testpy.py
{1}
1
4
[571 20:37:29.014 ubuntu@devbox ~]∴python testpy.py             
{0, 1, 2, 3}
4
4

for example macOS does not support os.sched_getaffinity:

∴python
>>> import os
>>> os.sched_getaffinity(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'os' has no attribute 'sched_getaffinity'

I suspect we'll want to keep multiprocessing.cpu_count() as a fallback.

akshara08 commented 1 year ago

Yeah, makes sense. Looks likesched_getaffinity is only available on some UNIX systems and I'm assuming the cases where multiprocessing.cpu_count() is different from len(os.sched_getaffinity(0)) are in systems that allow shared resources and hopefully expose those interfaces when sched_getaffinity is called. MacOS/Windows do not expose these interfaces causing the error, in those cases I think it's fair to fallback to multiprocessing.cpu_count().

I can make a PR in couple of days

akshara08 commented 1 year ago

I'm sorry for the delay on this. Life happened. I should free up in the next couple of weeks. It's still on my radar

anyscalesam commented 6 months ago

@akshara08 did you get a chance to work on that PR? Can you link it here?

Superskyyy commented 6 months ago

I can take over if @akshara08 is away. @anyscalesam please assign me.

akshara08 commented 6 months ago

Hi. Yes, sorry I'm caught up with something. Feel free to take it from me