ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.06k stars 5.6k forks source link

[New scheduler] Don't assume 1-CPU tasks are feasible #12870

Open wuisawesome opened 3 years ago

wuisawesome commented 3 years ago

What is the problem?

We currently make a few assumptions about 1-CPU tasks always being feasible (for example in our heartbeat code).

You can imagine that this could break down for an extremely elastic cluster which has num_cpus=0 on the head node.

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

rkooo567 commented 3 years ago

Is it still the case after we introduce the infeasible tasks?

wuisawesome commented 3 years ago

I think so because we special case the 1 cpu shape in the heartbeat: https://github.com/ray-project/ray/blob/master/src/ray/raylet/scheduling/cluster_task_manager.cc#L529