whites11 commented 11 months ago

Towards: https://github.com/giantswarm/giantswarm/issues/28720

goal of this PR is disable autoscaler on a node pool while the asg is rolling nodes.

Cluster creation:

CF stack for a NP does not contain the k8s.io/cluster-autoscaler/enabled tag as desired

Screenshot_20231114_140752

ASG gets correctly created without the tag as well

Screenshot_20231114_141039

AWS operator does another reconciliation loop and adds the tag:

Screenshot_20231114_141345

Upgrade

The tag is removed fine and CF does not add it back.

after CF is correctly updated, the tag is added back. works nicely!

Checklist

[x] Update changelog in CHANGELOG.md.

paurosello commented 11 months ago

I1120 15:52:53.833531       1 static_autoscaler.go:235] Starting main loop
I1120 15:52:53.834103       1 filter_out_schedulable.go:65] Filtering out schedulables
I1120 15:52:53.834126       1 filter_out_schedulable.go:137] Filtered out 0 pods using hints
I1120 15:52:53.834137       1 filter_out_schedulable.go:176] 0 pods were kept as unschedulable based on caching
I1120 15:52:53.834145       1 filter_out_schedulable.go:177] 0 pods marked as unschedulable can be scheduled.
I1120 15:52:53.834154       1 filter_out_schedulable.go:87] No schedulable pods
I1120 15:52:53.834169       1 static_autoscaler.go:437] No unschedulable pods
I1120 15:52:53.834197       1 static_autoscaler.go:484] Calculating unneeded nodes
I1120 15:52:53.834211       1 pre_filtering_processor.go:57] Node ip-10-1-2-221.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 15:52:53.834224       1 pre_filtering_processor.go:57] Node ip-10-1-1-57.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 15:52:53.834233       1 pre_filtering_processor.go:57] Node ip-10-1-2-110.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 15:52:53.834238       1 pre_filtering_processor.go:57] Node ip-10-1-2-102.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 15:52:53.834243       1 pre_filtering_processor.go:57] Node ip-10-1-2-14.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 15:52:53.834272       1 static_autoscaler.go:538] Scale down status: unneededOnly=false lastScaleUpTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownDeleteTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownFailTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I1120 15:52:53.834309       1 static_autoscaler.go:551] Starting scale down
I1120 15:52:53.834369       1 scale_down.go:917] No candidates for scale down

Works, as you can see nodes are not considered for downscaling.

paurosello commented 11 months ago

And after the upgrade it's scaling down

I1120 16:14:26.574106       1 cluster.go:167] node ip-10-1-2-110.eu-central-1.compute.internal may be removed
I1120 16:14:26.574114       1 cluster.go:139] ip-10-1-2-102.eu-central-1.compute.internal for removal
I1120 16:14:26.574206       1 cluster.go:150] node ip-10-1-2-102.eu-central-1.compute.internal cannot be removed: non-daemonset, non-mirrored, non-pdb-assigned kube-system pod present: hubble-relay-77f95d8cdf-pcpzw
I1120 16:14:26.574229       1 scale_down.go:612] 1 nodes found to be unremovable in simulation, will re-check them at 2023-11-20 16:19:26.365123138 +0000 UTC m=+2247.772568188
I1120 16:14:26.574280       1 static_autoscaler.go:527] ip-10-1-2-130.eu-central-1.compute.internal is unneeded since 2023-11-20 16:14:26.365123138 +0000 UTC m=+1947.772568188 duration 0s
I1120 16:14:26.574303       1 static_autoscaler.go:527] ip-10-1-2-110.eu-central-1.compute.internal is unneeded since 2023-11-20 16:14:26.365123138 +0000 UTC m=+1947.772568188 duration 0s
I1120 16:14:26.574331       1 static_autoscaler.go:538] Scale down status: unneededOnly=false lastScaleUpTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownDeleteTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownFailTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I1120 16:14:26.574365       1 static_autoscaler.go:551] Starting scale down
I1120 16:14:26.574405       1 scale_down.go:828] ip-10-1-2-130.eu-central-1.compute.internal was unneeded for 0s
I1120 16:14:26.574422       1 scale_down.go:828] ip-10-1-2-110.eu-central-1.compute.internal was unneeded for 0s
I1120 16:14:26.574444       1 scale_down.go:917] No candidates for scale down
I1120 16:14:26.584675       1 delete.go:103] Successfully added DeletionCandidateTaint on node ip-10-1-2-130.eu-central-1.compute.internal
I1120 16:14:26.594633       1 delete.go:103] Successfully added DeletionCandidateTaint on node ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:38.240438       1 reflector.go:536] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:320: Watch close - *v1.DaemonSet total 40 items received
I1120 16:14:48.808921       1 reflector.go:536] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 19 items received
I1120 16:14:54.236736       1 reflector.go:536] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246: Watch close - *v1.Node total 20 items received
I1120 16:14:56.602084       1 static_autoscaler.go:235] Starting main loop
I1120 16:14:56.602556       1 taints.go:77] Removing autoscaler soft taint when creating template from node
I1120 16:14:56.624196       1 filter_out_schedulable.go:65] Filtering out schedulables
I1120 16:14:56.624213       1 filter_out_schedulable.go:137] Filtered out 0 pods using hints
I1120 16:14:56.624221       1 filter_out_schedulable.go:176] 0 pods were kept as unschedulable based on caching
I1120 16:14:56.624227       1 filter_out_schedulable.go:177] 0 pods marked as unschedulable can be scheduled.
I1120 16:14:56.624236       1 filter_out_schedulable.go:87] No schedulable pods
I1120 16:14:56.624256       1 static_autoscaler.go:437] No unschedulable pods
I1120 16:14:56.624285       1 static_autoscaler.go:484] Calculating unneeded nodes
I1120 16:14:56.624308       1 pre_filtering_processor.go:57] Node ip-10-1-1-57.eu-central-1.compute.internal should not be processed by cluster autoscaler (no node group config)
I1120 16:14:56.624345       1 scale_down.go:448] Node ip-10-1-2-110.eu-central-1.compute.internal - cpu utilization 0.446571
I1120 16:14:56.624386       1 scale_down.go:448] Node ip-10-1-2-130.eu-central-1.compute.internal - cpu utilization 0.386571
I1120 16:14:56.624400       1 scale_down.go:509] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s
I1120 16:14:56.624440       1 cluster.go:139] ip-10-1-2-110.eu-central-1.compute.internal for removal
I1120 16:14:56.624719       1 cluster.go:322] Pod security-bundle/exception-recommender-866fd4dc6-nn97s can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.624785       1 cluster.go:322] Pod kube-system/vertical-pod-autoscaler-recommender-5488676cf9-p8zj7 can be moved to ip-10-1-2-130.eu-central-1.compute.internal
I1120 16:14:56.624834       1 cluster.go:322] Pod kube-system/external-dns-665f9b69df-h67mb can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.624876       1 cluster.go:322] Pod kube-system/vertical-pod-autoscaler-updater-7548f4f59d-4ls4k can be moved to ip-10-1-2-130.eu-central-1.compute.internal
I1120 16:14:56.624918       1 cluster.go:322] Pod kube-system/aws-pod-identity-webhook-app-76d7ccbf76-q7xr9 can be moved to ip-10-1-2-130.eu-central-1.compute.internal
I1120 16:14:56.624935       1 cluster.go:167] node ip-10-1-2-110.eu-central-1.compute.internal may be removed
I1120 16:14:56.624942       1 cluster.go:139] ip-10-1-2-130.eu-central-1.compute.internal for removal
I1120 16:14:56.625293       1 cluster.go:322] Pod kube-system/cert-manager-app-cainjector-6554fdb9b6-5vkjs can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.625340       1 cluster.go:322] Pod kube-system/cert-exporter-deployment-866d987dff-5wr67 can be moved to ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:56.625378       1 cluster.go:322] Pod kube-system/vertical-pod-autoscaler-admission-controller-6987f7bcff-jjx2f can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.625426       1 cluster.go:322] Pod kube-system/prometheus-operator-app-kube-state-metrics-947999bf6-gwqkm can be moved to ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:56.625552       1 cluster.go:322] Pod kube-system/cert-manager-app-webhook-78d4f6464d-b2k4w can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.625640       1 cluster.go:322] Pod kube-system/aws-pod-identity-webhook-app-76d7ccbf76-8p94z can be moved to ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:56.625700       1 cluster.go:322] Pod kube-system/cert-manager-app-778c9d78f6-g4cvj can be moved to ip-10-1-1-57.eu-central-1.compute.internal
I1120 16:14:56.625760       1 cluster.go:322] Pod kube-system/coredns-workers-7c7f49dcf6-fgkv8 can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.625804       1 cluster.go:322] Pod kube-system/prometheus-operator-app-operator-f9fd58cdb-l9h8k can be moved to ip-10-1-2-102.eu-central-1.compute.internal
I1120 16:14:56.625854       1 cluster.go:322] Pod security-bundle/kyverno-policy-operator-5646d858cf-4vmff can be moved to ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:56.625954       1 cluster.go:322] Pod kube-system/cilium-operator-85dd5884bb-hbr4p can be moved to ip-10-1-1-57.eu-central-1.compute.internal
I1120 16:14:56.626024       1 cluster.go:322] Pod kube-system/metrics-server-7f6744c45-hgzz6 can be moved to ip-10-1-2-110.eu-central-1.compute.internal
I1120 16:14:56.626067       1 cluster.go:167] node ip-10-1-2-130.eu-central-1.compute.internal may be removed
I1120 16:14:56.626106       1 static_autoscaler.go:527] ip-10-1-2-110.eu-central-1.compute.internal is unneeded since 2023-11-20 16:14:26.365123138 +0000 UTC m=+1947.772568188 duration 30.236939674s
I1120 16:14:56.626131       1 static_autoscaler.go:527] ip-10-1-2-130.eu-central-1.compute.internal is unneeded since 2023-11-20 16:14:26.365123138 +0000 UTC m=+1947.772568188 duration 30.236939674s
I1120 16:14:56.626152       1 static_autoscaler.go:538] Scale down status: unneededOnly=false lastScaleUpTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownDeleteTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 lastScaleDownFailTime=2023-11-20 14:42:22.632355361 +0000 UTC m=-3575.960199665 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I1120 16:14:56.626175       1 static_autoscaler.go:551] Starting scale down
I1120 16:14:56.626211       1 scale_down.go:828] ip-10-1-2-110.eu-central-1.compute.internal was unneeded for 30.236939674s
I1120 16:14:56.626225       1 scale_down.go:828] ip-10-1-2-130.eu-central-1.compute.internal was unneeded for 30.236939674s
I1120 16:14:56.626246       1 scale_down.go:917] No candidates for scale down

giantswarm / aws-operator

Tag disable autoscaler #3657

Cluster creation:

Upgrade

Checklist