[Serve] Calculate autoscaling decisions over whole scale_delay_s period

Description

Currently, as per https://github.com/ray-project/ray/blob/f4e85381d22b40b45929dde4a5566960ca3f298d/python/ray/serve/autoscaling_policy.py#L85, scaling decisions are only made if they are consistent over the upscale_delay_s or downscale_delay_s however the final scaling decision is only based on the desired_num_replicas for that single call.

It would be nice to calculate the desired_num_replicas over the entire period. Even better would be selecting between min/max/average desired_num_replicas calculated over the delay period.

We can store the desired_num_replicas in the policy_state and caclulate the min/max/avg once the scaling decision is made.

Use case

Currently, deployments with high variability in request counts have no clear mechanism to scale appropriately. For example, if downscaling_delay_s=60 and there are 6 checks to the autoscaler, desired_num_replicas=>[10,10,15,10,10,5] then the cluster will scale to 5. Alternatively, if desired_num_replicas=>[2,2,2,15,2,2] there is a case to be made that the cluster should use 15 in order to accomodate the maximum traffic.

Note: Increasing look_back_period_s results in slowing down all scaling decisions, as well as changing the entire balance of ongoing_requests due to the smoothing effect. upscaling_factor and downscaling_factor can be used to help, but in the example cases above they still completely miss the correct autoscaling decision.

ray-project / ray

[Serve] Calculate autoscaling decisions over whole scale_delay_s period #46497

Description

Use case