Open zcin opened 5 months ago
@zcin I think the more general way to think of this is to expose our existing autoscaling algorithm with alternative/pluggable metrics. These could be built-in metrics that we collect (e.g., CPU, GPU consumption) or we could expose a mechanism to plug in user-defined metrics.
This would be quite future proof and avoid users having to "reinvent the wheel" for all of the edge case handing & stability fixes we've added.
@edoakes ANy updates on this? Happy to contirbute if required. Im currently looking for custom respouce based metrics to make autoscaling decisions. Currently I just want it to be similar to Sagemaker(QPS based), where I want num instances to be = Total_QPS/some_constant. Curious if theres any Deployment_Handle api or something i can call to set num_replicas , somewhere my head node can monitor these metrics and set that value
Currently, Ray Serve uses request based autoscaling. This means the user defines a
target_ongoing_requests
which is the target number of requests they want to be concurrently executing on any replica at all times. When the average number of ongoing requests per replica exceeds or drops below the target number, the algorithm will correspondingly upscale or downscale the number of replicas.One alternative that we can support is resource based autoscaling, which would set a target resource utilization amount, and upscale or downscale based on that. This can be useful if resource utilization is a more accurate measure of how overloaded an application is compared to number of ongoing requests.