kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
7.95k stars 3.94k forks source link

Add easing to balancer #6119

Open abursavich opened 12 months ago

abursavich commented 12 months ago

Which component are you using?:

Balancer

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

When the balancer distributes replicas, it doesn't ease into the new distribution. It assigns replicas immediately.

This applies to proportional distribution, but is more obvious with priority distribution so I'll use that as an example.

Lets say you have cluster autoscaler and a workload with 100 replicas that runs on two different deployments each using different node types, prioritizing A over B. Assume there are no capacity issues so A has all 100 replicas and B has 0 replicas. For some reason you decide you want to switch your priorities to B over A. When the balancer notices this change, it will set the replicas of A to 0 and B to 100 without any controlled transition. All of the A replicas with be deleted without waiting for any B replicas to become available, which may require waiting for the cluster autoscaler to kick in and acquire B nodes. If a FallbackPolicy is configured, it might kick in before the B replicas are available and assign A replicas, but the previous A nodes may already be gone by then and you'll still have had an outage in the meantime.

Describe the solution you'd like.:

I haven't fully thought this through, but I think a mechanism similar to a rolling deployment update with a maxSurge and maxUnavailable would be appropriate, with the caveat that the targets may have their own things going on that effect their available replicas beyond the balancer's control (e.g. deployment rollout).

Limiting the scale down is an easier problem than limiting scale up. You might want to have at least one pod pending in each target that's under its desired available replicas as a probe (assuming the problem is something like nodes out of quota/stock). But maybe there's some problem with that specific pod/node and if you tried to schedule more, the others would come up.

jbartosik commented 12 months ago

@mwielgus FYI

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

abursavich commented 7 months ago

/remove-lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

abursavich commented 4 months ago

/remove-lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

abursavich commented 1 month ago

/remove-lifecycle stale