ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.13k stars 5.79k forks source link

Implement Adaptive Scaling for Distributed Task Scheduling #48536

Open VadisettyRahul opened 2 weeks ago

VadisettyRahul commented 2 weeks ago

Description

Ray currently relies on static configurations for task scheduling, limiting efficiency during dynamically changing workloads. Adding adaptive scaling would allow clusters to automatically expand or contract based on resource demands, improving both utilization and response times.

Proposed Solution:

1. Monitor Resource Usage:

2. Implement Auto-Scaling Logic:

3. Dynamic Task Assignment:

4. Testing & Validation:

Expected Outcome: This feature would enable clusters to dynamically respond to changing loads, improving resource efficiency and overall task execution speed.

Use case

No response

rynewang commented 1 week ago

@jjyao this is similar to the idea we talked about - a more flexible, configurable sched system.

arcyleung commented 1 day ago

Hi friends, this is also a feature needed by our team to perform adaptive SLA scheduling. I had discussed previously with @anyscalesam and there appears to be many requests from different teams for such a feature. As such I submitted a REP and also ran some experiments comparing power of 2 scheduling vs. the adaptive SLA one.

My colleague @Superskyyy suggested to move the discussion here because we want to show adaptivity can also be extended beyond hardware usages to SLA use cases, and an improvement at the Ray Core level can be enjoyed by top level frameworks like Ray Data/ Ray Serve.