ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.87k stars 5.76k forks source link

[RFC][core] Option to avoid scheduling tasks to nodes with disk full #30843

Open stephanie-wang opened 1 year ago

stephanie-wang commented 1 year ago

Description

For load-balancing purposes, it is often desirable to schedule a task onto a node with less disk space. A user might also require a certain amount of disk space to run a task, and ideally if it fails on one node, have it be automatically retried on another node that does have enough disk space.

We can make two possible enhancements:

Use case

No response

stephanie-wang commented 1 year ago

cc @jjyao

pedropgusmao commented 1 year ago

I'm very interested in this. Spillage is becoming a problem when it fills the disks. Maybe a wrapper that returns information, whether it is spilling, would be great.