Open icelava opened 9 months ago
/area provider/aws
Additional note on the workaround. We essentially restart the autoscaler deployment as the last step of our automation workflow.
kubectl rollout restart deployment/aws-cluster-autoscaler -n kube-system
I was experiencing the same behavior when the NodeTermination Handler marked the last node in the node pool with his taint and ASG reached 0 size. After this ClusterAutoscaler is not able to scale it up because of
2024-01-24T08:58:11+03:00 I0124 05:58:11.046721 1 orchestrator.go:546] Pod gitlab-runner/runner-eucgy1fpg-project-517-concurrent-1-f41lk8jv can't be scheduled on ciq-ci-gitlab-agents2023101009545070710000000e, predicate checking error: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-65653465366439612d303362612d373737302d353735}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-65653465366439612d303362612d373737302d353735}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"gitlab-agent", Value:"true", Effect:"NoSchedule", TimeAdded:<nil>}, v1.Taint{Key:"aws-node-termination-handler/asg-lifecycle-termination", Value:"asg-lifecycle-term-65653465366439612d303362612d373737302d353735", Effect:"NoExecute", TimeAdded:<nil>}}
CA pod restart helps
Confirming replication of this issue on a production cluster, helm chart 9.25.0 version 1.24.0
I have the same issues if I create mixed instance types in the same node group, I only kept one instance type in the different node groups, the taint feature can work if the node scales up from 0.
desiredCapacity: 0
minSize: 0
maxSize: 10
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/blocknode: "owned"
k8s.io/cluster-autoscaler/node-template/taint/xxx.com/name: "true:NoSchedule"
taints:
- key: xxx.com/name
value: true
effect: NoSchedule
The version 1.30.1 has the same issue. Node was terminated last night, ASG has desired capacity 0, but CA still "see" the node in the cluster and doesn't scale up the ASG
{"ts":1719909365710.1602,"caller":"orchestrator/orchestrator.go:565","msg":"Pod jenkins-aqa/aqa-build-agent-235-qjzfw-cw6lm can't be scheduled on ciq-ci-jenkins-aqa-tests-agents20230808050032196800000003, predicate checking error: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-34363838643263332d303665352d326261662d333838}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-34363838643263332d303665352d326261662d333838}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:\"aws-node-termination-handler/asg-lifecycle-termination\", Value:\"asg-lifecycle-term-34363838643263332d303665352d326261662d333838\", Effect:\"NoExecute\", TimeAdded:<nil>}}","v":2}
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
/lifecycle frozen
Which component are you using?: cluster-autoscaler
What version of the component are you using?: cluster-autoscaler
Component version: helm chart 9.26.0 cluster-autoscaler 1.28.2
What k8s version are you using (
kubectl version
)?: 1.28kubectl version
OutputWhat environment is this in?: AWS EKS; managed node groups
What did you expect to happen?: Untainted node group be able launch nodes and schedule pods again.
What happened instead?: autoscaler still thinks node group has the long-gone node with taint and thus won't launch another node instance, even though the taint has been removed from node group.
How to reproduce it (as minimally and precisely as possible):
Taint node group to evict pods
aws eks update-nodegroup-config --cluster-name eks-cluster --nodegroup-name ZeroNodes --taints addOrUpdateTaints={key=cost,value=true,effect=NO_EXECUTE}
Pods get evicted and nodes eventually terminated to zero count.
Untaint node group to re-host pods.
aws eks update-nodegroup-config --cluster-name eks-cluster --nodegroup-name ZeroNodes --taints removeTaints={key=cost,value=true,effect=NO_EXECUTE}
Pod remains in perpetual pending state, as per events above, with autoscaler thinking the the old node with taint still around and refuses to launch another node in its place (without the taint).
Anything else we need to know?: Workaround to deliberately kill autoscaler pod, so replacement autoscaler pod with no memory of the past can correctly auto-discover the node group and launch node to host the pod as per events above.
It seems like autoscaler wants to hang on to outdated historical data of terminated nodes in node group "1 node(s) had untolerated taint". It should be describing the node groups afresh to determine where to launch a new node.