giantswarm / aws-operator

Manages Kubernetes clusters running on AWS (before Cluster API)
https://www.giantswarm.io/
Apache License 2.0
131 stars 22 forks source link

Tag disable autoscaler #3657

Closed whites11 closed 11 months ago

whites11 commented 11 months ago

Towards: https://github.com/giantswarm/giantswarm/issues/28720

goal of this PR is disable autoscaler on a node pool while the asg is rolling nodes.

Cluster creation:

CF stack for a NP does not contain the k8s.io/cluster-autoscaler/enabled tag as desired

Screenshot_20231114_140752

ASG gets correctly created without the tag as well

Screenshot_20231114_141039

AWS operator does another reconciliation loop and adds the tag:

Screenshot_20231114_141345

Upgrade

The tag is removed fine and CF does not add it back.

after CF is correctly updated, the tag is added back. works nicely!

Checklist

paurosello commented 11 months ago
I1120 15:52:53.833531       1 static_autoscaler.go:235] Starting main loop
I1120 15:52:53.834103       1 filter_out_schedulable.go:65] Filtering out schedulables
I1120 15:52:53.834126       1 filter_out_schedulable.go:137] Filtered out 0 pods using hints
I1120 15:52:53.834137       1 filter_out_schedulable.go:176] 0 pods were kept as unschedulable based on caching
I1120 15:52:53.834145       1 filter_out_schedulable.go:177] 0 pods marked as unschedulable can be scheduled.
I1120 15:52:53.834154       1 filter_out_schedulable.go:87] No schedulable pods
I1120 15:52:53.834169       1 static_autoscaler.go:437] No unschedulable pods
I1120 15:52:53.834197       1 static_autoscaler.go:484] Calculating unneeded nodes
I1120 15:52:53.834211       1 pre_filtering_processor.go:57] Node ip-10-1-2-221.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 15:52:53.834224       1 pre_filtering_processor.go:57] Node ip-10-1-1-57.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 15:52:53.834233       1 pre_filtering_processor.go:57] Node ip-10-1-2-110.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 15:52:53.834238       1 pre_filtering_processor.go:57] Node ip-10-1-2-102.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 15:52:53.834243       1 pre_filtering_processor.go:57] Node ip-10-1-2-14.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 15:52:53.834272       1 static_autoscaler.go:538] Scale down status: unneededOnly=false lastScaleUpTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownDeleteTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownFailTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I1120 15:52:53.834309       1 static_autoscaler.go:551] Starting scale down
I1120 15:52:53.834369       1 scale_down.go:917] No candidates for scale down

Works, as you can see nodes are not considered for downscaling.

paurosello commented 11 months ago

And after the upgrade it's scaling down

I1120 16:14:26.574106       1 cluster.go:167] node ip-10-1-2-110.eu-central-1.compute.internal may be removed
I1120 16:14:26.574114       1 cluster.go:139] ip-10-1-2-102.eu-central-1.compute.internal for removal
I1120 16:14:26.574206       1 cluster.go:150] node ip-10-1-2-102.eu-central-1.compute.internal cannot be removed: non-daemonset, non-mirrored, non-pdb-assigned kube-system pod present: hubble-relay-77f95d8cdf-pcpzw
I1120 16:14:26.574229       1 scale_down.go:612] 1 nodes found to be unremovable in simulation, will re-check them at 2023-11-20 16:19:26.365123138 +0000 UTC m=+2247.772568188
I1120 16:14:26.574280       1 static_autoscaler.go:527] ip-10-1-2-130.eu-central-1.compute.internal is unneeded since 2023-11-20 16:14:26.365123138 +0000 UTC m=+1947.772568188 duration 0s
I1120 16:14:26.574303       1 static_autoscaler.go:527] ip-10-1-2-110.eu-central-1.compute.internal is unneeded since 2023-11-20 16:14:26.365123138 +0000 UTC m=+1947.772568188 duration 0s
I1120 16:14:26.574331       1 static_autoscaler.go:538] Scale down status: unneededOnly=false lastScaleUpTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownDeleteTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownFailTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I1120 16:14:26.574365       1 static_autoscaler.go:551] Starting scale down
I1120 16:14:26.574405       1 scale_down.go:828] ip-10-1-2-130.eu-central-1.compute.internal was unneeded for 0s
I1120 16:14:26.574422       1 scale_down.go:828] ip-10-1-2-110.eu-central-1.compute.internal was unneeded for 0s
I1120 16:14:26.574444       1 scale_down.go:917] No candidates for scale down
I1120 16:14:26.584675       1 delete.go:103] Successfully added DeletionCandidateTaint on node ip-10-1-2-130.eu-central-1.compute.internal
I1120 16:14:26.594633       1 delete.go:103] Successfully added DeletionCandidateTaint on node ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:38.240438       1 reflector.go:536] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:320: Watch close - *v1.DaemonSet total 40 items received
I1120 16:14:48.808921       1 reflector.go:536] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 19 items received
I1120 16:14:54.236736       1 reflector.go:536] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246: Watch close - *v1.Node total 20 items received
I1120 16:14:56.602084       1 static_autoscaler.go:235] Starting main loop
I1120 16:14:56.602556       1 taints.go:77] Removing autoscaler soft taint when creating template from node
I1120 16:14:56.624196       1 filter_out_schedulable.go:65] Filtering out schedulables
I1120 16:14:56.624213       1 filter_out_schedulable.go:137] Filtered out 0 pods using hints
I1120 16:14:56.624221       1 filter_out_schedulable.go:176] 0 pods were kept as unschedulable based on caching
I1120 16:14:56.624227       1 filter_out_schedulable.go:177] 0 pods marked as unschedulable can be scheduled.
I1120 16:14:56.624236       1 filter_out_schedulable.go:87] No schedulable pods
I1120 16:14:56.624256       1 static_autoscaler.go:437] No unschedulable pods
I1120 16:14:56.624285       1 static_autoscaler.go:484] Calculating unneeded nodes
I1120 16:14:56.624308       1 pre_filtering_processor.go:57] Node ip-10-1-1-57.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 16:14:56.624345       1 scale_down.go:448] Node ip-10-1-2-110.eu-central-1.compute.internal - cpu utilization 0.446571
I1120 16:14:56.624386       1 scale_down.go:448] Node ip-10-1-2-130.eu-central-1.compute.internal - cpu utilization 0.386571
I1120 16:14:56.624400       1 scale_down.go:509] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s
I1120 16:14:56.624440       1 cluster.go:139] ip-10-1-2-110.eu-central-1.compute.internal for removal
I1120 16:14:56.624719       1 cluster.go:322] Pod security-bundle/exception-recommender-866fd4dc6-nn97s can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.624785       1 cluster.go:322] Pod kube-system/vertical-pod-autoscaler-recommender-5488676cf9-p8zj7 can be moved to ip-10-1-2-130.eu-central-1.compute.internal
I1120 16:14:56.624834       1 cluster.go:322] Pod kube-system/external-dns-665f9b69df-h67mb can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.624876       1 cluster.go:322] Pod kube-system/vertical-pod-autoscaler-updater-7548f4f59d-4ls4k can be moved to ip-10-1-2-130.eu-central-1.compute.internal
I1120 16:14:56.624918       1 cluster.go:322] Pod kube-system/aws-pod-identity-webhook-app-76d7ccbf76-q7xr9 can be moved to ip-10-1-2-130.eu-central-1.compute.internal
I1120 16:14:56.624935       1 cluster.go:167] node ip-10-1-2-110.eu-central-1.compute.internal may be removed
I1120 16:14:56.624942       1 cluster.go:139] ip-10-1-2-130.eu-central-1.compute.internal for removal
I1120 16:14:56.625293       1 cluster.go:322] Pod kube-system/cert-manager-app-cainjector-6554fdb9b6-5vkjs can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.625340       1 cluster.go:322] Pod kube-system/cert-exporter-deployment-866d987dff-5wr67 can be moved to ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:56.625378       1 cluster.go:322] Pod kube-system/vertical-pod-autoscaler-admission-controller-6987f7bcff-jjx2f can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.625426       1 cluster.go:322] Pod kube-system/prometheus-operator-app-kube-state-metrics-947999bf6-gwqkm can be moved to ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:56.625552       1 cluster.go:322] Pod kube-system/cert-manager-app-webhook-78d4f6464d-b2k4w can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.625640       1 cluster.go:322] Pod kube-system/aws-pod-identity-webhook-app-76d7ccbf76-8p94z can be moved to ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:56.625700       1 cluster.go:322] Pod kube-system/cert-manager-app-778c9d78f6-g4cvj can be moved to ip-10-1-1-57.eu-central-1.compute.internal
I1120 16:14:56.625760       1 cluster.go:322] Pod kube-system/coredns-workers-7c7f49dcf6-fgkv8 can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.625804       1 cluster.go:322] Pod kube-system/prometheus-operator-app-operator-f9fd58cdb-l9h8k can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.625854       1 cluster.go:322] Pod security-bundle/kyverno-policy-operator-5646d858cf-4vmff can be moved to ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:56.625954       1 cluster.go:322] Pod kube-system/cilium-operator-85dd5884bb-hbr4p can be moved to ip-10-1-1-57.eu-central-1.compute.internal
I1120 16:14:56.626024       1 cluster.go:322] Pod kube-system/metrics-server-7f6744c45-hgzz6 can be moved to ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:56.626067       1 cluster.go:167] node ip-10-1-2-130.eu-central-1.compute.internal may be removed
I1120 16:14:56.626106       1 static_autoscaler.go:527] ip-10-1-2-110.eu-central-1.compute.internal is unneeded since 2023-11-20 16:14:26.365123138 +0000 UTC m=+1947.772568188 duration 30.236939674s
I1120 16:14:56.626131       1 static_autoscaler.go:527] ip-10-1-2-130.eu-central-1.compute.internal is unneeded since 2023-11-20 16:14:26.365123138 +0000 UTC m=+1947.772568188 duration 30.236939674s
I1120 16:14:56.626152       1 static_autoscaler.go:538] Scale down status: unneededOnly=false lastScaleUpTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownDeleteTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownFailTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I1120 16:14:56.626175       1 static_autoscaler.go:551] Starting scale down
I1120 16:14:56.626211       1 scale_down.go:828] ip-10-1-2-110.eu-central-1.compute.internal was unneeded for 30.236939674s
I1120 16:14:56.626225       1 scale_down.go:828] ip-10-1-2-130.eu-central-1.compute.internal was unneeded for 30.236939674s
I1120 16:14:56.626246       1 scale_down.go:917] No candidates for scale down