[serve] Support resource-based autoscaling

ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

33.28k stars 5.63k forks source link

[serve] Support resource-based autoscaling #44901

Open zcin opened 5 months ago

zcin commented 5 months ago

Currently, Ray Serve uses request based autoscaling. This means the user defines a target_ongoing_requests which is the target number of requests they want to be concurrently executing on any replica at all times. When the average number of ongoing requests per replica exceeds or drops below the target number, the algorithm will correspondingly upscale or downscale the number of replicas.

One alternative that we can support is resource based autoscaling, which would set a target resource utilization amount, and upscale or downscale based on that. This can be useful if resource utilization is a more accurate measure of how overloaded an application is compared to number of ongoing requests.

edoakes commented 5 months ago

@zcin I think the more general way to think of this is to expose our existing autoscaling algorithm with alternative/pluggable metrics. These could be built-in metrics that we collect (e.g., CPU, GPU consumption) or we could expose a mechanism to plug in user-defined metrics.

This would be quite future proof and avoid users having to "reinvent the wheel" for all of the edge case handing & stability fixes we've added.

danishshaikh556 commented 2 months ago

@edoakes ANy updates on this? Happy to contirbute if required. Im currently looking for custom respouce based metrics to make autoscaling decisions. Currently I just want it to be similar to Sagemaker(QPS based), where I want num instances to be = Total_QPS/some_constant. Curious if theres any Deployment_Handle api or something i can call to set num_replicas , somewhere my head node can monitor these metrics and set that value