Open cnmcavoy opened 1 year ago
great report, thank you for the depth of explanation.
The cluster autoscaler should manage machine deployments and understand how to manage scale up and scale down behavior when a machine deployment is in the midst of a rollout.
the capi autoscaler provider is very dumb about machinesets and machinedeployments, essentially treating them the same. we could probably add more logic to help in distinguish the behaviors between the two types.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
afaik this is still a work in progress
/remove-lifecycle stale
This is stalled waiting for maintainer input.
At my company we think
We've been using our fork of CA + CAPI to address this issue internally since February 2023. Given the lack of response, we're planning to heavily evaluate a switch to Karpenter instead of continuing to use CA.
@wmgroot thanks for raising the concerns again. i tend to agree that this is a potential problem waiting to hit users, especially users with large clusters.
i will raise the issue about pr #5538 at an upcoming SIG meeting, it seems like we are just waiting on a comment from @MaciekPytel . i'm generally ok with the solution there.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
the related PR got closed as we took too long to approve it, i think this bug is worth keeping open though, perhaps someone can pickup @wmgroot 's work and carry it through.
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
not sure where this is at, but i do think it's important
/remove-lifecycle stale
In fact, this is a problem that has always existed in cluster autoscaler, not only in cluster-api provider, but also in AWS/Alicloud provider, etc. Cluster autoscaler does not directly manage instances, but uses nodegroup as an intermediary to indirectly operate instances by adjusting nodegroup replicas, which will cause the instance to be scaled down to a normal instance.
thank you for the comment and link @songminglong !
Which component are you using?: cluster-autoscaler
What version of the component are you using?: v1.2.4
Component version:
What k8s version are you using (
kubectl version
)?:What environment is this in?: AWS
What did you expect to happen?: The cluster autoscaler should manage machine deployments and understand how to manage scale up and scale down behavior when a machine deployment is in the midst of a rollout. When a machine deployment is in a rollout, multiple machinesets exist with non-zero replica counts.
When the cluster autoscaler is managing machine deployments that are in undergoing a rollout (they own multiple machinesets that have non-zero replicas) and the provisioning nodes are not able to join the cluster in a timely manner, the cluster autoscaler detects these as "long unregistered" nodes. I expected the cluster autoscaler to correctly remove these long unregistered nodes and stop considering them for pod placement, and scale up another node group.
What happened instead?: What we observed was that when the cluster autoscaler is managing machine deployments that are in undergoing a rollout (they own multiple machinesets that have non-zero replicas) and the provisioning nodes are not able to join the cluster in a timely manner, the cluster autoscaler detects these as "long unregistered" nodes. When it determines that these nodes have not joined the cluster after the max provision time has been exceeded, the cluster autoscaler marks these machines for deletion by setting an annotation and decreases the replica count of the machine deployment for the nodegroup.
Scaling the machine deployment at this point is tricky - scale up on the replica count causes the newest machineset desired replicas to increase. Scale down on the machine deployment replica count causes the oldest machineset to shrink the desired replicas. I think this is the core confusion for the cluster autoscaler.
So when the replica count for the machine deployment is decreased, the CAPI controller takes over. When the machine deployment replica count is decreased and multiple machinesets exist with non-zero replicas, the oldest machineset has a replica removed. This is not the machine that was stuck provisioning, because it was part of the new machineset that did not successfully finish coming up. So this causes a healthy node in the old machineset to be scaled down, in addition to the old provisioning machine that was annotated.
How to reproduce it (as minimally and precisely as possible):
preKubeadmCommands: ['sleep infinity']
) and update the machine deployment to begin a rolloutA brief gist which shows the sort of configuration we used to reproduce this: https://gist.github.com/cnmcavoy/364cb3020017c22fc63c75540cf58923
Anything else we need to know?: We first encountered this running a fork of the v1.26 cluster autoscaler, which we were running to use the scale from zero feature set. We re-tested on the official tagged v1.24 release and reproduced it there as well.