[RFC][core] Option to avoid scheduling tasks to nodes with disk full

stephanie-wang commented 1 year ago

Description

For load-balancing purposes, it is often desirable to schedule a task onto a node with less disk space. A user might also require a certain amount of disk space to run a task, and ideally if it fails on one node, have it be automatically retried on another node that does have enough disk space.

We can make two possible enhancements:

[ ] Consider disk utilization in the scheduling policy. We could alternatively consider just Ray spilled objects.
[ ] Support arbitrary user-defined scheduling constraints, like "only schedule this task on a node with X disk space"

Use case

No response

stephanie-wang commented 1 year ago

cc @jjyao

pedropgusmao commented 1 year ago

I'm very interested in this. Spillage is becoming a problem when it fills the disks. Maybe a wrapper that returns information, whether it is spilling, would be great.

ray-project / ray

[RFC][core] Option to avoid scheduling tasks to nodes with disk full #30843

Description

Use case