What k8s version are you using (kubectl version)?:
kubectl version Output
$ kubectl version
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.*********
Server Version: v1.30.4-eks-a737599
What environment is this in?:
It's in Dev environment
We are using AWS cloud
What did you expect to happen?:
So, basically, I am using Cluster AutoScaler to autoscale the nodes in two node groups(on-demand ng and spot ng). I have implemented NTH and Priority Expander to give preference to Spot instances. As Spot instance can be down due to the bidding system, we have NTH for that. However, I cannot figure out why the on-demand node is getting down and going in Unknown status for more than 6-8 hours. Also, if the on-demand node is going down, CA should create a new one, but it's not doing so and taking more than 5-6 hours.
What happened instead?:
I am expecting that CA scale up and scale down the nodes should be in some time e.g. 5-10 minutes. but the on-demand instances are going down without any reason and CA creating on-demand nodes which are taking more than 5-6 hours. As the node is in Unknown status, all the pods that are running on the on-demand node are in the Terminating instance for more than 4-5 hours which is very frustrating as I am facing downtime due to RollUpdate Strategy.
How to reproduce it (as minimally and precisely as possible):
Here is the script of deployment and Kubernetes component that I am using for the CA
I1101 04:02:57.320439 1 aws_manager.go:188] Found multiple availability zones for ASG "-e2c94f70-6b1c-a9af-47eb-fea9a5915955"; using ap-south-1c for failure-domain.beta.kubernetes.io/zone label
I1101 04:02:57.320612 1 filter_out_schedulable.go:66] Filtering out schedulables
I1101 04:02:57.320714 1 klogx.go:87] failed to find place for logging/fluentd-jswwh: cannot put pod fluentd-jswwh on any node
I1101 04:02:57.320729 1 filter_out_schedulable.go:123] 0 pods marked as unschedulable can be scheduled.
I1101 04:02:57.320738 1 filter_out_schedulable.go:86] No schedulable pods
I1101 04:02:57.320743 1 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I1101 04:02:57.320748 1 filter_out_daemon_sets.go:49] Filtered out 1 daemon set pods, 0 unschedulable pods left
I1101 04:02:57.320766 1 static_autoscaler.go:557] No unschedulable pods
I1101 04:02:57.320797 1 static_autoscaler.go:580] Calculating unneeded nodes
I1101 04:02:57.320812 1 pre_filtering_processor.go:67] Skipping ip-10-1-137-190.ap-south-1.compute.internal - node group min size reached (current: 1, min: 1)
I1101 04:02:57.320898 1 eligibility.go:104] Scale-down calculation: ignoring 5 nodes unremovable in the last 5m0s
I1101 04:02:57.320940 1 static_autoscaler.go:623] Scale down status: lastScaleUpTime=2024-11-01 03:42:50.431875787 +0000 UTC m=+127994.930106250 lastScaleDownDeleteTime=2024-10-31 06:18:29.370821589 +0000 UTC m=+50933.869052042 lastScaleDownFailTime=2024-10-30 15:09:57.022381669 +0000 UTC m=-3578.479387878 scaleDownForbidden=false scaleDownInCooldown=false
I1101 04:02:57.320969 1 static_autoscaler.go:644] Starting scale down
Which component are you using?:
What version of the component are you using?:
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?: It's in Dev environment
We are using AWS cloud What did you expect to happen?:
So, basically, I am using Cluster AutoScaler to autoscale the nodes in two node groups(on-demand ng and spot ng). I have implemented NTH and Priority Expander to give preference to Spot instances. As Spot instance can be down due to the bidding system, we have NTH for that. However, I cannot figure out why the on-demand node is getting down and going in Unknown status for more than 6-8 hours. Also, if the on-demand node is going down, CA should create a new one, but it's not doing so and taking more than 5-6 hours. What happened instead?:
I am expecting that CA scale up and scale down the nodes should be in some time e.g. 5-10 minutes. but the on-demand instances are going down without any reason and CA creating on-demand nodes which are taking more than 5-6 hours. As the node is in Unknown status, all the pods that are running on the on-demand node are in the Terminating instance for more than 4-5 hours which is very frustrating as I am facing downtime due to RollUpdate Strategy.
How to reproduce it (as minimally and precisely as possible):
Here is the script of deployment and Kubernetes component that I am using for the CA
Anything else we need to know?:
I1101 04:02:57.320439 1 aws_manager.go:188] Found multiple availability zones for ASG "-e2c94f70-6b1c-a9af-47eb-fea9a5915955"; using ap-south-1c for failure-domain.beta.kubernetes.io/zone label
I1101 04:02:57.320612 1 filter_out_schedulable.go:66] Filtering out schedulables
I1101 04:02:57.320714 1 klogx.go:87] failed to find place for logging/fluentd-jswwh: cannot put pod fluentd-jswwh on any node
I1101 04:02:57.320729 1 filter_out_schedulable.go:123] 0 pods marked as unschedulable can be scheduled.
I1101 04:02:57.320738 1 filter_out_schedulable.go:86] No schedulable pods
I1101 04:02:57.320743 1 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I1101 04:02:57.320748 1 filter_out_daemon_sets.go:49] Filtered out 1 daemon set pods, 0 unschedulable pods left
I1101 04:02:57.320766 1 static_autoscaler.go:557] No unschedulable pods
I1101 04:02:57.320797 1 static_autoscaler.go:580] Calculating unneeded nodes
I1101 04:02:57.320812 1 pre_filtering_processor.go:67] Skipping ip-10-1-137-190.ap-south-1.compute.internal - node group min size reached (current: 1, min: 1)
I1101 04:02:57.320898 1 eligibility.go:104] Scale-down calculation: ignoring 5 nodes unremovable in the last 5m0s
I1101 04:02:57.320940 1 static_autoscaler.go:623] Scale down status: lastScaleUpTime=2024-11-01 03:42:50.431875787 +0000 UTC m=+127994.930106250 lastScaleDownDeleteTime=2024-10-31 06:18:29.370821589 +0000 UTC m=+50933.869052042 lastScaleDownFailTime=2024-10-30 15:09:57.022381669 +0000 UTC m=-3578.479387878 scaleDownForbidden=false scaleDownInCooldown=false
I1101 04:02:57.320969 1 static_autoscaler.go:644] Starting scale down
Node Status
ip-10-1-137-190.ap-south-1.compute.internal NotReady 23h v1.30.4-eks-a737599