Implement Adaptive Scaling for Distributed Task Scheduling

VadisettyRahul commented 2 weeks ago

Description

Ray currently relies on static configurations for task scheduling, limiting efficiency during dynamically changing workloads. Adding adaptive scaling would allow clusters to automatically expand or contract based on resource demands, improving both utilization and response times.

Proposed Solution:

1. Monitor Resource Usage:

Add a monitoring module to track CPU, GPU, and memory usage across nodes.
Use Ray's existing metrics API to track real-time usage statistics and resource availability.

2. Implement Auto-Scaling Logic:

Develop scaling logic that activates when usage exceeds or drops below pre-defined thresholds.
Add configuration options to allow users to set upper and lower limits for scaling.
Use Ray’s autoscaler as a foundation, modifying it to support adaptive responses to real-time metrics.

3. Dynamic Task Assignment:

Adjust task allocation dynamically based on resource availability, optimizing performance and load balancing.
Allow tasks to prioritize nodes with greater availability or lower load to minimize latency.

4. Testing & Validation:

Design unit tests for threshold-based scaling, ensuring tasks are allocated efficiently.
Perform integration tests on clusters of varying sizes to confirm adaptive scaling functionality.

Expected Outcome: This feature would enable clusters to dynamically respond to changing loads, improving resource efficiency and overall task execution speed.

Use case

No response

rynewang commented 1 week ago

@jjyao this is similar to the idea we talked about - a more flexible, configurable sched system.

arcyleung commented 1 day ago

Hi friends, this is also a feature needed by our team to perform adaptive SLA scheduling. I had discussed previously with @anyscalesam and there appears to be many requests from different teams for such a feature. As such I submitted a REP and also ran some experiments comparing power of 2 scheduling vs. the adaptive SLA one.

My colleague @Superskyyy suggested to move the discussion here because we want to show adaptivity can also be extended beyond hardware usages to SLA use cases, and an improvement at the Ray Core level can be enjoyed by top level frameworks like Ray Data/ Ray Serve.

ray-project / ray

Implement Adaptive Scaling for Distributed Task Scheduling #48536

Description

Use case