Open makzzz1986 opened 7 months ago
Looks like shrinking is Ok because of: https://github.com/kubernetes/autoscaler/blob/3fd892a37b50a885eaceaa9619a1a3e153548dc9/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go#L338 but I can't explain how shrinking desired capacity by the number of unregistered nodes brings the desired capacity to almost zero. Can it be that Cluster-Autoscaler request removal the same instance few times and drops the desired capacity more than one? 🤔
/area cluster-autoscaler
/cc
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Which component are you using?: cluster-autoscaler
What version of the component are you using?: v2.7.2
Component version: Kubernetes server v1.26.14
What environment is this in?: AWS EKS
What did you expect to happen?: When Cluster-Autoscaler removes an unregistered node, it should not decrease the desired capacity of AWS.
What happened instead?: When a broken Kubelet configuration is introduced and EC2 instance can't register as a node, Cluster-Autoscaler terminates the instance and decreases the Desired Capacity of AutoScaling Group by the number of terminated instances. It causes a rapid decrease of instances and healthy nodes to zero, especially if AutoScaling Group removes the oldest instance during shrinking
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: We experienced it a few times, some logs: This is the chart of AutoScaling Group monitoring while reproducing the behavior
ASG Activity log: repeatedly appearing log like: At 2024-05-06T13:38:33Z a user request explicitly set group desired capacity changing the desired capacity from 18 to 17. At 2024-05-06T13:38:44Z a user request explicitly set group desired capacity changing the desired capacity from 17 to 15. At 2024-05-06T13:38:47Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 18 to 15. At 2024-05-06T13:38:47Z instance i-08259e65446b2711a was selected for termination. At 2024-05-06T13:38:47Z instance i-03a8da6f7f73f80f7 was selected for termination. At 2024-05-06T13:38:47Z instance i-0946a707c1d4d62f5 was selected for termination. then it repeats, dropping the desired capacity very fast every few seconds:
I could find on the source code that decreasing Desired Capacity is logged, but I could not find it on Cluster-Autoscaler logs, only logs of removal instances: I0506 13:38:32.981481 1 static_autoscaler.go:289] Starting main loop I0506 13:38:32.981659 1 auto_scaling_groups.go:367] Regenerating instance to ASG map for ASG names: [] I0506 13:38:32.981674 1 auto_scaling_groups.go:374] Regenerating instance to ASG map for ASG tags: map[k8s.io/cluster-autoscaler/eks-01-shared-staging: k8s.io/cluster-autoscaler/enabled:] I0506 13:38:33.089424 1 auto_scaling_groups.go:140] Updating ASG ASG_NAME_IS_REMOVED I0506 13:38:33.089603 1 aws_wrapper.go:693] 0 launch configurations to query I0506 13:38:33.089613 1 aws_wrapper.go:694] 0 launch templates to query I0506 13:38:33.089618 1 aws_wrapper.go:714] Successfully queried 0 launch configurations I0506 13:38:33.089622 1 aws_wrapper.go:725] Successfully queried 0 launch templates I0506 13:38:33.089627 1 aws_wrapper.go:736] Successfully queried instance requirements for 0 ASGs I0506 13:38:33.089637 1 aws_manager.go:129] Refreshed ASG list, next refresh after 2024-05-06 13:39:33.089634918 +0000 UTC m=+260083.881492346 I0506 13:38:33.093156 1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label I0506 13:38:33.093444 1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label I0506 13:38:33.093720 1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label I0506 13:38:33.094109 1 clusterstate.go:623] Found longUnregistered Nodes [aws:///eu-west-1c/i-0356e7154743c269d aws:///eu-west-1b/i-0739a0918fbfaeff1 aws:///eu-west-1c/i-0d52e62f88b44e06f aws:///eu-west-1a/i-024ee7cacf172923e] I0506 13:38:33.094144 1 static_autoscaler.go:405] 13 unregistered nodes present I0506 13:38:33.094170 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1a/i-024ee7cacf172923e I0506 13:38:33.273533 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-024ee7cacf172923e I0506 13:38:33.273546 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation I0506 13:38:33.273594 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1b/i-0739a0918fbfaeff1 I0506 13:38:33.396748 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0739a0918fbfaeff1 I0506 13:38:33.396764 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation I0506 13:38:33.396811 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1c/i-0d52e62f88b44e06f I0506 13:38:33.599688 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0d52e62f88b44e06f I0506 13:38:33.599706 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation I0506 13:38:33.599755 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1c/i-0356e7154743c269d I0506 13:38:33.835324 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0356e7154743c269d I0506 13:38:33.835341 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation I0506 13:38:33.835382 1 static_autoscaler.go:413] Some unregistered nodes were removed I0506 13:38:33.835513 1 filter_out_schedulable.go:63] Filtering out schedulables I0506 13:38:33.835665 1 klogx.go:87] failed to find place for XXXX: cannot put pod XXXX on any node I0506 13:38:33.835818 1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling I0506 13:38:33.835913 1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling I0506 13:38:33.836019 1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling I0506 13:38:33.836028 1 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled. I0506 13:38:33.836043 1 filter_out_schedulable.go:83] No schedulable pods I0506 13:38:33.836048 1 filter_out_daemon_sets.go:40] Filtering out daemon set pods I0506 13:38:33.836053 1 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 4 unschedulable pods left I0506 13:38:33.836067 1 klogx.go:87] Pod XXXX is unschedulable I0506 13:38:33.836072 1 klogx.go:87] Pod XXXX is unschedulable I0506 13:38:33.836078 1 klogx.go:87] Pod XXXX is unschedulable I0506 13:38:33.836084 1 klogx.go:87] Pod XXXX is unschedulable I0506 13:38:33.836362 1 orchestrator.go:109] Upcoming 0 nodes I0506 13:38:33.837357 1 orchestrator.go:466] Pod XXXX can't be scheduled on ASG_NAME_IS_REMOVED, predicate checking error: node(s) had untolerated taint {worker-type: criticalservices}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {worker-type: criticalservices}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"worker-type", Value:"criticalservices", Effect:"NoSchedule", TimeAdded:}}
I0506 13:38:33.837372 1 orchestrator.go:468] 3 other pods similar to XXXX can't be scheduled on ASG_NAME_IS_REMOVED
I0506 13:38:33.837381 1 orchestrator.go:167] No pod can fit to ASG_NAME_IS_REMOVED
I0506 13:38:33.837485 1 orchestrator.go:466] Pod XXXX can't be scheduled on ASG_NAME_IS_REMOVED, predicate checking error: node(s) had untolerated taint {app: solr}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {app: solr}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"app", Value:"solr", Effect:"NoSchedule", TimeAdded:}}
I0506 13:38:33.837500 1 orchestrator.go:468] 3 other pods similar to XXXX can't be scheduled on ASG_NAME_IS_REMOVED
I0506 13:38:33.837510 1 orchestrator.go:167] No pod can fit to ASG_NAME_IS_REMOVED
I0506 13:38:33.837522 1 orchestrator.go:193] Best option to resize: ASG_NAME_IS_REMOVED
I0506 13:38:33.837536 1 orchestrator.go:197] Estimated 2 nodes needed in ASG_NAME_IS_REMOVED
I0506 13:38:33.837559 1 orchestrator.go:310] Final scale-up plan: [{ASG_NAME_IS_REMOVED 15->17 (max: 80)}]
I0506 13:38:33.837574 1 orchestrator.go:582] Scale-up: setting group ASG_NAME_IS_REMOVED size to 17
I0506 13:38:33.837590 1 auto_scaling_groups.go:255] Setting asg ASG_NAME_IS_REMOVED size to 17
I0506 13:38:34.001090 1 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop
*** In the logs I can find changing scale-up plan to increase the Desired Capacity, trying to increase it constantly, but it doesn't help: I0506 13:33:23.132521 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 14->16 (max: 80)}] I0506 13:33:33.769051 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 13->15 (max: 80)}] I0506 13:33:44.438663 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 12->14 (max: 80)}] I0506 13:33:54.988557 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 11->13 (max: 80)}] I0506 13:34:05.685369 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 10->12 (max: 80)}] I0506 13:34:16.256407 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 9->11 (max: 80)}] I0506 13:34:26.822806 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 8->10 (max: 80)}] I0506 13:34:37.435677 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 7->9 (max: 80)}] I0506 13:34:47.884433 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 8->10 (max: 80)}] I0506 13:34:58.303376 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 9->11 (max: 80)}] I0506 13:35:09.222123 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 10->12 (max: 80)}] I0506 13:35:19.832029 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 11->13 (max: 80)}] I0506 13:35:30.142655 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 13->15 (max: 80)}] I0506 13:35:40.288624 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 15->17 (max: 80)}] I0506 13:35:50.673319 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 17->19 (max: 80)}] I0506 13:36:00.991247 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 19->20 (max: 80)}] I0506 13:38:12.098179 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 18->20 (max: 80)}] I0506 13:38:22.838214 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 17->19 (max: 80)}]