Cluster-Autoscaler decrease AWS AutoScalingGroup desired capacity during unregistered nodes removal that causes unneeded shriking of the ASG

makzzz1986 commented 7 months ago

Which component are you using?: cluster-autoscaler

What version of the component are you using?: v2.7.2

Component version: Kubernetes server v1.26.14

What environment is this in?: AWS EKS

What did you expect to happen?: When Cluster-Autoscaler removes an unregistered node, it should not decrease the desired capacity of AWS.

What happened instead?: When a broken Kubelet configuration is introduced and EC2 instance can't register as a node, Cluster-Autoscaler terminates the instance and decreases the Desired Capacity of AutoScaling Group by the number of terminated instances. It causes a rapid decrease of instances and healthy nodes to zero, especially if AutoScaling Group removes the oldest instance during shrinking

How to reproduce it (as minimally and precisely as possible):

Change a LaunchTemplate with a broken UserData (any type of kubelet-extra-args) to AutoScalingGroup.
Scale any application of a cluster to make Cluster-Autoscaler increase the Desired Capacity of AutoScalingGroup
Wait 5 minutes

Anything else we need to know?: We experienced it a few times, some logs: This is the chart of AutoScaling Group monitoring while reproducing the behavior

ASG Activity log: repeatedly appearing log like: At 2024-05-06T13:38:33Z a user request explicitly set group desired capacity changing the desired capacity from 18 to 17. At 2024-05-06T13:38:44Z a user request explicitly set group desired capacity changing the desired capacity from 17 to 15. At 2024-05-06T13:38:47Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 18 to 15. At 2024-05-06T13:38:47Z instance i-08259e65446b2711a was selected for termination. At 2024-05-06T13:38:47Z instance i-03a8da6f7f73f80f7 was selected for termination. At 2024-05-06T13:38:47Z instance i-0946a707c1d4d62f5 was selected for termination. then it repeats, dropping the desired capacity very fast every few seconds:

At 2024-05-06T13:38:33Z a user request explicitly set group desired capacity changing the desired capacity from 18 to 17. At 2024-05-06T13:38:44Z a user request explicitly set group desired capacity changing the desired capacity from 17 to 15. At 2024-05-06T13:38:47Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 18 to 15
At 2024-05-06T13:38:55Z a user request explicitly set group desired capacity changing the desired capacity from 15 to 13. At 2024-05-06T13:38:59Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 15 to 13
At 2024-05-06T13:39:06Z a user request explicitly set group desired capacity changing the desired capacity from 13 to 11. At 2024-05-06T13:39:10Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 13 to 11
At 2024-05-06T13:39:17Z a user request explicitly set group desired capacity changing the desired capacity from 11 to 9. At 2024-05-06T13:39:22Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 11 to 9
At 2024-05-06T13:39:27Z a user request explicitly set group desired capacity changing the desired capacity from 9 to 7. At 2024-05-06T13:39:34Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 9 to 7
At 2024-05-06T13:39:38Z instance i-045007d227f6406aa was taken out of service in response to a user request, shrinking the capacity from 7 to 6.
At 2024-05-06T13:39:38Z instance i-0a90f6d48b464f279 was taken out of service in response to a user request, shrinking the capacity from 6 to 5.
At 2024-05-06T13:39:38Z instance i-0a461072a3a045985 was taken out of service in response to a user request, shrinking the capacity from 5 to 4.

I I0506 13:38:32.981481 I0506 13:38:32.981659 I0506 13:38:32.981674 I0506 13:38:33.089424 I0506 13:38:33.089603 I0506 13:38:33.089613 I0506 13:38:33.089618 I0506 13:38:33.089622 I0506 13:38:33.089627 I0506 13:38:33.089637 I0506 13:38:33.093156 I0506 13:38:33.093444 I0506 13:38:33.093720 I0506 13:38:33.094109 I0506 13:38:33.094144 I0506 13:38:33.094170 I0506 13:38:33.273533 I0506 13:38:33.273546 I0506 13:38:33.273594 I0506 13:38:33.396748 I0506 13:38:33.396764 I0506 13:38:33.396811 I0506 13:38:33.599688 I0506 13:38:33.599706 I0506 13:38:33.599755 I0506 13:38:33.835324 I0506 13:38:33.835341 I0506 13:38:33.835382 I0506 13:38:33.835513 I0506 13:38:33.835665 I0506 13:38:33.835818 I0506 13:38:33.835913 I0506 13:38:33.836019 I0506 13:38:33.836028 I0506 13:38:33.836043 I0506 13:38:33.836048 I0506 13:38:33.836053 I0506 13:38:33.836067 I0506 13:38:33.836072 I0506 13:38:33.836078 I0506 13:38:33.836084 I0506 13:38:33.836362 I0506 13:38:33.837357 I0506 13:38:33.837372 I0506 13:38:33.837381 I0506 13:38:33.837485 I0506 13:38:33.837500 I0506 13:38:33.837510 I0506 13:38:33.837522 I0506 13:38:33.837536 I0506 13:38:33.837559 I0506 13:38:33.837574 I0506 13:38:33.837590 I0506 13:38:34.001090 could find on the source code that decreasing Desired Capacity is logged, but I could not find it on Cluster-Autoscaler logs, only logs of removal instances: 1 static_autoscaler.go:289] Starting main loop 1 auto_scaling_groups.go:367] Regenerating instance to ASG map for ASG names: [] 1 auto_scaling_groups.go:374] Regenerating instance to ASG map for ASG tags: map[k8s.io/cluster-autoscaler/eks-01-shared-staging: k8s.io/cluster-autoscaler/enabled:] 1 auto_scaling_groups.go:140] Updating ASG ASG_NAME_IS_REMOVED 1 aws_wrapper.go:693] 0 launch configurations to query 1 aws_wrapper.go:694] 0 launch templates to query 1 aws_wrapper.go:714] Successfully queried 0 launch configurations 1 aws_wrapper.go:725] Successfully queried 0 launch templates 1 aws_wrapper.go:736] Successfully queried instance requirements for 0 ASGs 1 aws_manager.go:129] Refreshed ASG list, next refresh after 2024-05-06 13:39:33.089634918 +0000 UTC m=+260083.881492346 1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label 1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label 1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label 1 clusterstate.go:623] Found longUnregistered Nodes [aws:///eu-west-1c/i-0356e7154743c269d aws:///eu-west-1b/i-0739a0918fbfaeff1 aws:///eu-west-1c/i-0d52e62f88b44e06f aws:///eu-west-1a/i-024ee7cacf172923e] 1 static_autoscaler.go:405] 13 unregistered nodes present 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1a/i-024ee7cacf172923e 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-024ee7cacf172923e 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1b/i-0739a0918fbfaeff1 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0739a0918fbfaeff1 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1c/i-0d52e62f88b44e06f 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0d52e62f88b44e06f 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1c/i-0356e7154743c269d 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0356e7154743c269d 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation 1 static_autoscaler.go:413] Some unregistered nodes were removed 1 filter_out_schedulable.go:63] Filtering out schedulables 1 klogx.go:87] failed to find place for XXXX: cannot put pod XXXX on any node 1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling 1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling 1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling 1 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled. 1 filter_out_schedulable.go:83] No schedulable pods 1 filter_out_daemon_sets.go:40] Filtering out daemon set pods 1 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 4 unschedulable pods left 1 klogx.go:87] Pod XXXX is unschedulable 1 klogx.go:87] Pod XXXX is unschedulable 1 klogx.go:87] Pod XXXX is unschedulable 1 klogx.go:87] Pod XXXX is unschedulable 1 orchestrator.go:109] Upcoming 0 nodes 1 orchestrator.go:466] Pod XXXX can't be scheduled on ASG_NAME_IS_REMOVED, predicate checking error: node(s) had untolerated taint {worker-type: criticalservices}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {worker-type: criticalservices}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"worker-type", Value:"criticalservices", Effect:"NoSchedule", TimeAdded:}} 1 orchestrator.go:468] 3 other pods similar to XXXX can't be scheduled on ASG_NAME_IS_REMOVED 1 orchestrator.go:167] No pod can fit to ASG_NAME_IS_REMOVED 1 orchestrator.go:466] Pod XXXX can't be scheduled on ASG_NAME_IS_REMOVED, predicate checking error: node(s) had untolerated taint {app: solr}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {app: solr}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"app", Value:"solr", Effect:"NoSchedule", TimeAdded:}} 1 orchestrator.go:468] 3 other pods similar to XXXX can't be scheduled on ASG_NAME_IS_REMOVED 1 orchestrator.go:167] No pod can fit to ASG_NAME_IS_REMOVED 1 orchestrator.go:193] Best option to resize: ASG_NAME_IS_REMOVED 1 orchestrator.go:197] Estimated 2 nodes needed in ASG_NAME_IS_REMOVED 1 orchestrator.go:310] Final scale-up plan: [{ASG_NAME_IS_REMOVED 15->17 (max: 80)}] 1 orchestrator.go:582] Scale-up: setting group ASG_NAME_IS_REMOVED size to 17 1 auto_scaling_groups.go:255] Setting asg ASG_NAME_IS_REMOVED size to 17 1 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop

*** In the logs I can find changing scale-up plan to increase the Desired Capacity, trying to increase it constantly, but it doesn't help: I0506 13:33:23.132521 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 14->16 (max: 80)}] I0506 13:33:33.769051 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 13->15 (max: 80)}] I0506 13:33:44.438663 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 12->14 (max: 80)}] I0506 13:33:54.988557 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 11->13 (max: 80)}] I0506 13:34:05.685369 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 10->12 (max: 80)}] I0506 13:34:16.256407 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 9->11 (max: 80)}] I0506 13:34:26.822806 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 8->10 (max: 80)}] I0506 13:34:37.435677 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 7->9 (max: 80)}] I0506 13:34:47.884433 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 8->10 (max: 80)}] I0506 13:34:58.303376 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 9->11 (max: 80)}] I0506 13:35:09.222123 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 10->12 (max: 80)}] I0506 13:35:19.832029 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 11->13 (max: 80)}] I0506 13:35:30.142655 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 13->15 (max: 80)}] I0506 13:35:40.288624 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 15->17 (max: 80)}] I0506 13:35:50.673319 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 17->19 (max: 80)}] I0506 13:36:00.991247 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 19->20 (max: 80)}] I0506 13:38:12.098179 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 18->20 (max: 80)}] I0506 13:38:22.838214 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 17->19 (max: 80)}]

makzzz1986 commented 7 months ago

Looks like shrinking is Ok because of: https://github.com/kubernetes/autoscaler/blob/3fd892a37b50a885eaceaa9619a1a3e153548dc9/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go#L338 but I can't explain how shrinking desired capacity by the number of unregistered nodes brings the desired capacity to almost zero. Can it be that Cluster-Autoscaler request removal the same instance few times and drops the desired capacity more than one? 🤔

adrianmoisey commented 4 months ago

/area cluster-autoscaler

songminglong commented 3 months ago

/cc

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes / autoscaler

Cluster-Autoscaler decrease AWS AutoScalingGroup desired capacity during unregistered nodes removal that causes unneeded shriking of the ASG #6795