ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.45k stars 5.67k forks source link

[Core] alloc whole node to run some job #33041

Open wangyugui-e16 opened 1 year ago

wangyugui-e16 commented 1 year ago

Description

Is it OK to alloc whole node to run some job. Inside the job, it detect the cpu number, and then start the nproc threads

Use case

The servers in a cluster have different cpu numbers. server1:72 server2:48 server3:40 server4:40

It is better to alloc whole node with all cpu, than alloc some cpu number.

wangyugui-e16 commented 1 year ago

can we use 'CPU:-1' to alloc all cpu of a node, and gpu:-1 to alloc all GPU of a node?

rickyyx commented 1 year ago

Hey @wangyugui-e16 - what kind of APIs are you referring to here?

Did you mean when you add a node into a cluster? Or dispatch some tasks/actors to an already existing ray cluster?

wangyugui-e16 commented 1 year ago

I mean 'dispatch some tasks/actors to an already existing ray cluster?'

Inside the task, we can detech the cpu number by 'nproc' command and then use all of them.

rickyyx commented 1 year ago

So dispatch a task to a node and allows the task to use as many CPUs as available?

I guess it's hard to do that inside a task, but you could do something like this at submission:

@ray.remote
def f():
    pass

sched_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
        node_id=<node id>,
        soft=False,
)

# F will then be scheduled to the node speicifed above, with the number of CPUs you specified. 
f.options(num_cpus=<cpus-on-a-node>, scheduling_strategy=sched_strategy).remote()
stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.