ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.93k stars 5.57k forks source link

[Serve] Calculate autoscaling decisions over whole scale_delay_s period #46497

Open kyle-v6x opened 1 month ago

kyle-v6x commented 1 month ago

Description

Currently, as per https://github.com/ray-project/ray/blob/f4e85381d22b40b45929dde4a5566960ca3f298d/python/ray/serve/autoscaling_policy.py#L85, scaling decisions are only made if they are consistent over the upscale_delay_s or downscale_delay_s however the final scaling decision is only based on the desired_num_replicas for that single call.

It would be nice to calculate the desired_num_replicas over the entire period. Even better would be selecting between min/max/average desired_num_replicas calculated over the delay period.

We can store the desired_num_replicas in the policy_state and caclulate the min/max/avg once the scaling decision is made.

Use case

Currently, deployments with high variability in request counts have no clear mechanism to scale appropriately. For example, if downscaling_delay_s=60 and there are 6 checks to the autoscaler, desired_num_replicas=>[10,10,15,10,10,5] then the cluster will scale to 5. Alternatively, if desired_num_replicas=>[2,2,2,15,2,2] there is a case to be made that the cluster should use 15 in order to accomodate the maximum traffic.

Note: Increasing look_back_period_s results in slowing down all scaling decisions, as well as changing the entire balance of ongoing_requests due to the smoothing effect. upscaling_factor and downscaling_factor can be used to help, but in the example cases above they still completely miss the correct autoscaling decision.

kyle-v6x commented 1 month ago

I'm happy to work on this myself. Ideally adding a new parameter to the autoscaling_config called scaling_function: Union['last', 'average', 'max', 'min'] = 'last'