cluster-autoscaler still thinks node group has taints (after untaint) and refuses to scale back up from zero count

icelava commented 9 months ago

Which component are you using?: cluster-autoscaler

What version of the component are you using?: cluster-autoscaler

Component version: helm chart 9.26.0 cluster-autoscaler 1.28.2

What k8s version are you using (kubectl version)?: 1.28

kubectl version Output

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.5-eks-5e0fdde

What environment is this in?: AWS EKS; managed node groups

What did you expect to happen?: Untainted node group be able launch nodes and schedule pods again.

What happened instead?: autoscaler still thinks node group has the long-gone node with taint and thus won't launch another node instance, even though the taint has been removed from node group.

│ Events:                                                                                                              │
│   Type     Reason             Age                     From                Message                                    │
│   ----     ------             ----                    ----                -------                                    │
│   Normal   NotTriggerScaleUp  22m (x385 over 19h)     cluster-autoscaler  pod didn't trigger scale-up: 3 max node gr │
│ oup size reached, 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {cost: true}  │
│   Normal   NotTriggerScaleUp  12m (x179 over 19h)     cluster-autoscaler  pod didn't trigger scale-up: 3 max node gr │
│ oup size reached, 1 node(s) had untolerated taint {cost: true}, 1 node(s) didn't match Pod's node affinity/selector  │
│   Normal   NotTriggerScaleUp  7m11s (x1152 over 19h)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had │
│  untolerated taint {cost: true}, 1 node(s) didn't match Pod's node affinity/selector, 3 max node group size reached  │
│   Warning  FailedScheduling   5m (x161 over 19h)      default-scheduler   0/8 nodes are available: 8 node(s) didn't  │
│ match Pod's node affinity/selector. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling. │
│ .                                                                                                                    │
│   Normal   NotTriggerScaleUp  2m10s (x2508 over 19h)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) did │
│ n't match Pod's node affinity/selector, 1 node(s) had untolerated taint {cost: true}, 3 max node group size reached  │
│   Normal   TriggeredScaleUp   50s                     cluster-autoscaler  pod triggered scale-up: [{eks-ZeroNodes-ce │
│ c688d4-02dd-c7f5-6bd1-be1a14735f61 0->1 (max: 1)}]                                                                   │
│   Normal   Scheduled          3s                      default-scheduler   Successfully assigned default/nginx-test t │
│ o ip-10-0-76-17.ap-southeast-1.compute.internal                                                                      │
│   Normal   Pulling            2s                      kubelet             Pulling image "nginx:latest"               │

How to reproduce it (as minimally and precisely as possible):

Taint node group to evict pods

aws eks update-nodegroup-config --cluster-name eks-cluster --nodegroup-name ZeroNodes --taints addOrUpdateTaints={key=cost,value=true,effect=NO_EXECUTE}

Pods get evicted and nodes eventually terminated to zero count.

Untaint node group to re-host pods.

aws eks update-nodegroup-config --cluster-name eks-cluster --nodegroup-name ZeroNodes --taints removeTaints={key=cost,value=true,effect=NO_EXECUTE}

Pod remains in perpetual pending state, as per events above, with autoscaler thinking the the old node with taint still around and refuses to launch another node in its place (without the taint).

Anything else we need to know?: Workaround to deliberately kill autoscaler pod, so replacement autoscaler pod with no memory of the past can correctly auto-discover the node group and launch node to host the pod as per events above.

It seems like autoscaler wants to hang on to outdated historical data of terminated nodes in node group "1 node(s) had untolerated taint". It should be describing the node groups afresh to determine where to launch a new node.

Shubham82 commented 9 months ago

/area provider/aws

icelava commented 9 months ago

Additional note on the workaround. We essentially restart the autoscaler deployment as the last step of our automation workflow.

kubectl rollout restart deployment/aws-cluster-autoscaler -n kube-system

ivan-morhun commented 9 months ago

I was experiencing the same behavior when the NodeTermination Handler marked the last node in the node pool with his taint and ASG reached 0 size. After this ClusterAutoscaler is not able to scale it up because of

2024-01-24T08:58:11+03:00 I0124 05:58:11.046721       1 orchestrator.go:546] Pod gitlab-runner/runner-eucgy1fpg-project-517-concurrent-1-f41lk8jv can't be scheduled on ciq-ci-gitlab-agents2023101009545070710000000e, predicate checking error: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-65653465366439612d303362612d373737302d353735}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-65653465366439612d303362612d373737302d353735}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"gitlab-agent", Value:"true", Effect:"NoSchedule", TimeAdded:<nil>}, v1.Taint{Key:"aws-node-termination-handler/asg-lifecycle-termination", Value:"asg-lifecycle-term-65653465366439612d303362612d373737302d353735", Effect:"NoExecute", TimeAdded:<nil>}}

CA pod restart helps

linxcat commented 8 months ago

Confirming replication of this issue on a production cluster, helm chart 9.25.0 version 1.24.0

nooperpudd commented 7 months ago

I have the same issues if I create mixed instance types in the same node group, I only kept one instance type in the different node groups, the taint feature can work if the node scales up from 0.

desiredCapacity: 0
minSize: 0
maxSize: 10 
tags:
  k8s.io/cluster-autoscaler/enabled: "true"
  k8s.io/cluster-autoscaler/blocknode: "owned"
  k8s.io/cluster-autoscaler/node-template/taint/xxx.com/name: "true:NoSchedule"

taints:
  - key: xxx.com/name
    value: true
    effect: NoSchedule

ivan-morhun commented 4 months ago

The version 1.30.1 has the same issue. Node was terminated last night, ASG has desired capacity 0, but CA still "see" the node in the cluster and doesn't scale up the ASG

{"ts":1719909365710.1602,"caller":"orchestrator/orchestrator.go:565","msg":"Pod jenkins-aqa/aqa-build-agent-235-qjzfw-cw6lm can't be scheduled on ciq-ci-jenkins-aqa-tests-agents20230808050032196800000003, predicate checking error: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-34363838643263332d303665352d326261662d333838}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-34363838643263332d303665352d326261662d333838}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:\"aws-node-termination-handler/asg-lifecycle-termination\", Value:\"asg-lifecycle-term-34363838643263332d303665352d326261662d333838\", Effect:\"NoExecute\", TimeAdded:<nil>}}","v":2}

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ivan-morhun commented 1 week ago

/remove-lifecycle rotten

Shubham82 commented 1 week ago

/lifecycle frozen

kubernetes / autoscaler

cluster-autoscaler still thinks node group has taints (after untaint) and refuses to scale back up from zero count #6452