clusterapi: Cluster Autoscaler destroys healthy nodes when processing old unregistered nodes when multiple machinesets exist

cnmcavoy commented 1 year ago

Which component are you using?: cluster-autoscaler

What version of the component are you using?: v1.2.4

Component version:

What k8s version are you using (kubectl version)?:

Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.0-alpha.3.428+425ff1c8e2d298-dirty", GitCommit:"425ff1c8e2d29874926abb7da37641cf785c4ff3", GitTreeState:"dirty", BuildDate:"2022-03-10T21:18:29Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.7", GitCommit:"e6f35974b08862a23e7f4aad8e5d7f7f2de26c15", GitTreeState:"clean", BuildDate:"2022-10-12T10:50:21Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: AWS

What did you expect to happen?: The cluster autoscaler should manage machine deployments and understand how to manage scale up and scale down behavior when a machine deployment is in the midst of a rollout. When a machine deployment is in a rollout, multiple machinesets exist with non-zero replica counts.

When the cluster autoscaler is managing machine deployments that are in undergoing a rollout (they own multiple machinesets that have non-zero replicas) and the provisioning nodes are not able to join the cluster in a timely manner, the cluster autoscaler detects these as "long unregistered" nodes. I expected the cluster autoscaler to correctly remove these long unregistered nodes and stop considering them for pod placement, and scale up another node group.

What happened instead?: What we observed was that when the cluster autoscaler is managing machine deployments that are in undergoing a rollout (they own multiple machinesets that have non-zero replicas) and the provisioning nodes are not able to join the cluster in a timely manner, the cluster autoscaler detects these as "long unregistered" nodes. When it determines that these nodes have not joined the cluster after the max provision time has been exceeded, the cluster autoscaler marks these machines for deletion by setting an annotation and decreases the replica count of the machine deployment for the nodegroup.

Scaling the machine deployment at this point is tricky - scale up on the replica count causes the newest machineset desired replicas to increase. Scale down on the machine deployment replica count causes the oldest machineset to shrink the desired replicas. I think this is the core confusion for the cluster autoscaler.

So when the replica count for the machine deployment is decreased, the CAPI controller takes over. When the machine deployment replica count is decreased and multiple machinesets exist with non-zero replicas, the oldest machineset has a replica removed. This is not the machine that was stuck provisioning, because it was part of the new machineset that did not successfully finish coming up. So this causes a healthy node in the old machineset to be scaled down, in addition to the old provisioning machine that was annotated.

How to reproduce it (as minimally and precisely as possible):

Create a machine deployment with 2 replicas, backed by an AWSMachineTemplate and KubeadmConfigTemplate for the infrastructure and a no-op workload to occupy the each of the 2 nodes resources in a cluster with the cluster autoscaler configured
Wait for the node group to come up with 2 healthy nodes and the no-op workload to become ready
Configure a new KubeadmConfigTemplate that won't join the cluster (e.g preKubeadmCommands: ['sleep infinity']) and update the machine deployment to begin a rollout
Observe that the cluster autoscaler marks the stuck "provisioned" machines for deletion with the annotation and scales the node group down even though there is no place for the pods to go.

A brief gist which shows the sort of configuration we used to reproduce this: https://gist.github.com/cnmcavoy/364cb3020017c22fc63c75540cf58923

Anything else we need to know?: We first encountered this running a fork of the v1.26 cluster autoscaler, which we were running to use the scale from zero feature set. We re-tested on the official tagged v1.24 release and reproduced it there as well.

elmiko commented 1 year ago

great report, thank you for the depth of explanation.

The cluster autoscaler should manage machine deployments and understand how to manage scale up and scale down behavior when a machine deployment is in the midst of a rollout.

the capi autoscaler provider is very dumb about machinesets and machinedeployments, essentially treating them the same. we could probably add more logic to help in distinguish the behaviors between the two types.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elmiko commented 1 year ago

afaik this is still a work in progress

/remove-lifecycle stale

wmgroot commented 1 year ago

This is stalled waiting for maintainer input.

At my company we think

This is a serious issue that could impact a wide berth of cluster-autoscaler users (including CA + EKS).
This has not occurred because users are not typically failing to provision new EC2s from AWS in large amounts for long periods of time.
This is an architectural flaw of how CA requires cloud providers to implement fallback logic (node deletion requires downsizing a node group, because of ambiguity between the use of DecreaseTargetSize() and DeleteNodes())

We've been using our fork of CA + CAPI to address this issue internally since February 2023. Given the lack of response, we're planning to heavily evaluate a switch to Karpenter instead of continuing to use CA.

elmiko commented 1 year ago

@wmgroot thanks for raising the concerns again. i tend to agree that this is a potential problem waiting to hit users, especially users with large clusters.

i will raise the issue about pr #5538 at an upcoming SIG meeting, it seems like we are just waiting on a comment from @MaciekPytel . i'm generally ok with the solution there.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elmiko commented 7 months ago

the related PR got closed as we took too long to approve it, i think this bug is worth keeping open though, perhaps someone can pickup @wmgroot 's work and carry it through.

/remove-lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elmiko commented 3 months ago

not sure where this is at, but i do think it's important

/remove-lifecycle stale

songminglong commented 3 weeks ago

In fact, this is a problem that has always existed in cluster autoscaler, not only in cluster-api provider, but also in AWS/Alicloud provider, etc. Cluster autoscaler does not directly manage instances, but uses nodegroup as an intermediary to indirectly operate instances by adjusting nodegroup replicas, which will cause the instance to be scaled down to a normal instance.

eg: https://github.com/kubernetes/autoscaler/issues/6795

elmiko commented 2 weeks ago

thank you for the comment and link @songminglong !

kubernetes / autoscaler

clusterapi: Cluster Autoscaler destroys healthy nodes when processing old unregistered nodes when multiple machinesets exist #5465