[Serve] downscaling_factor logic is broken for no-scaling decisions

kyle-v6x commented 4 months ago

What happened + What you expected to happen

From 2.10 the downscaling_factor logic has been broken, especially in cases with low traffic. The PR https://github.com/ray-project/ray/pull/42394 results in scaling decisions always resulting in a reduction in 1 replica if the non-scaled and scaled decisions disagree. This makes the downscaling_factor non-linear, and results in downscaling decisions that completely disregard the scaling factor.

Untill this is changed we have had to lock our inference clusters at <2.10.

Note that the PR here https://github.com/ray-project/ray/pull/43730 also removes the warning when the downscaling_factor is <0.6. I would recommend reverting https://github.com/ray-project/ray/pull/42394, and re-adding the warning from https://github.com/ray-project/ray/pull/43730.

If you would like a PR for this let me know and I'll make the changes.

Versions / Dependencies

ray[serve]>=2.10

Reproduction script

Reproducible in any downscaling clusters. You can see in the downscaling logic.

Here is a replica graph from a live deployment with: ''' upscaling_factor: 0.7 downscaling_factor: 0.15 downscale_delay_s: 120 '''

Notice that despite the downscaling factor, downscaling decisions (-1 replica) happen ever 120 seconds despite the low downscaling_factor. This removes any smoothing effect, and results in quick oscillations that cause frequent timeouts for long-running inference tasks.

Issue Severity

High: It blocks me from completing my task.

zcin commented 4 months ago

@kyle-v6x Could you describe what behavior you would expect? Setting a small downscaling factor could block downscaling altogether, even if the traffic is 10% of the expected amount of traffic the current N running replicas should handle, defined by your own autoscaling config.

The purpose of a small downscaling factor is to prevent the serve controller from making drastic decisions, e.g. dropping from 100 replicas to 10 replicas. However it should not be prevent the number of replicas from dropping from 2 replicas to 1 replica if the current traffic is way below what 2 replicas can handle.

kyle-v6x commented 4 months ago

@zcin

So there's a few issues with the merged solution:

It changes the way the scaling factor operates for all values. Let's say you have a scaling factor of 0.6, the new code will drastically change the behaviour of the autoscaler as it will always remove one replica despite the smoothed value. For a cluster with 5 workers during a low-traffic period, this could cause a 20% mismatch in capacity. There is no way to easilly explain this behaviour, and it contradicts the config parameter
Previously, low scaling factors could actually be a benefit by providing a 'soft-min' for replica counts. After this point, the cluster would only scale down if the traffic was 0.

I would argue that the previous behaviour was not an issue, as it followed the expected behaviour of a low multiplicative factor. The new behaviour breaks smoothing for any clusters with a low numbers of replicas. It would be nice to revert the changes at least untill another solution is found that doesn't also break large scaling values (>=0.6).

Regarding potential long-term solutions after reverting the current solution:

Compute a momentum value based on previous scaling decisions, so each decision is not made indepentantly
Apply some log factor to squish the possible inputs
Apply some conditional logic to increase the smoothing factor for low replica counts: if smoothing_factor < 0.55: smoothing_factor = min(0.55, smoothing_factor + 1/num_running_replicas)

From my own testing, the third option seems the most ideal.

zcin commented 4 months ago

@kyle-v6x Could you help me better understand your issue?

For a cluster with 5 workers during a low-traffic period, this could cause a 20% mismatch in capacity.

In this case, Serve will decrease the number of replicas only if traffic is consistently at less than 80% of what your 5 workers is expected to handle. If you don't want the number of replicas to decrease in this case, you can tune target_ongoing_requests (if you'd rather have each worker handle less traffic), min_replicas (if you want to have each worker handle the same amount of traffic, but don't want the number of workers to drop below 5), or downscale_delay_s (if you want to decrease the frequency of downscale decisions to prevent oscillation).

Previously, low scaling factors could actually be a benefit by providing a 'soft-min' for replica counts. After this point, the cluster would only scale down if the traffic was 0.

Although this was a side effect of low scaling factors before, I think this is not the original intended purpose. Our motivation for introducing scaling factor was to prevent Serve from introducing too big of a "delta" in the number of replicas running in the cluster, to prevent oscillation in the replica graph. If "soft min" is your goal (i.e. only decrease replicas if traffic is zero, but otherwise keep it at some positive minimum of replicas) in my opinion this is more of a feature request. Do you want to file a github issue for this, and assign it to the Serve team?

kyle-v6x commented 4 months ago

Of course.

In this case, Serve will decrease the number of replicas only if traffic is consistently at less than 80% of what your 5 workers is expected to handle.

With the changes made, it will decrease the workers by one on the first downscaling_delay check, completely bypassing the scaling factor. As the warning previously mentioned, the scaling factor would only cause issues less than ~0.6. These changes are applied to all ranges of scaling factors, even those above 0.6. So the new logic will always result in 1 less worker than before the change. With 100 workers, it's marginal, but at something like 5 workers, this is a 20% error. This change fundamentally breaks the downscaling_factor for clusters with a low number of replicas. Even when the scaling_factor is greater than 0.6.

or downscale_delay_s (if you want to decrease the frequency of downscale decisions to prevent oscillation)

In this case, downscale_delay is not a solution. As per the logic here, the decision to downscale is only enforced if it has occured consistently. However, the decision_num_replicas is only calculated from the last decision. For a deployment with spikes in traffic, the downscale_delay therefore only results in huge delayed oscillations. This is exactly why the downscaling factor has been so useful. Alternatively, you could also store an average or min decision_num_replicas over the downscale_delay_s, but that is beyond the scope of this issue.

Although this was a side effect of low scaling factors before, I think this is not the original intended purpose.

I understand this was not the intended purpose, but it is a natural result of applying a scaler to the decision. If you would like to remove this behaviour, something as I suggested above does so without breaking scaling at low replica counts.

I notice this issues was created internally, so perhaps it was causing issues with your own deployments?

kyle-v6x commented 4 months ago

One more note:

Since the new changes only apply a maximum of -1 replica per scaling decision, I can try tune a larger downscale_delay_s to handle lower replica scaling, but this would come at the cost of more agile downscaling during peak time, and seems like a workaround rather than a solution.

kyle-v6x commented 4 months ago

Although I still think this change should be reverted, the feature suggestion here would provide a valid solution/workaround as well.

ray-project / ray