Open denihot opened 1 month ago
Hi,
I attempted to troubleshoot the issue by performing the following steps:
We have the same issue!!!
Hi @hakman, @johngmyers
sorry for the direct message, just last time you helped to solve the issue quickly :).
We are heavily relying on Kops(having 40+ clusters) and using Warmpool. In the recent releases of 1.29 the Warpools have been changed with the following PRs, which brought the mentioned issue.
Would appreciate to take a look and fix them! If there is any way we can support you in making it happen quickly, please let us know.
/kind bug
1. What
kops
version are you running? The commandkops version
, will display this information.1.29.2 (git-v1.29.2)
2. What Kubernetes version are you running?kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag.v1.29.6
3. What cloud provider are you using?AWS
4. What commands did you run? What is the simplest way to reproduce this issue? After editing kops config with the new k8s version I ran the following commands:kops get assets --copy --state $KOPS_REMOTE_STATE
kops update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE --allow-kops-downgrade
kops update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE
kops rolling-update cluster $CLUSTER_NAME --state $KOPS_REMOTE_STATE
kops rolling-update cluster $CLUSTER_NAME --yes --state $KOPS_REMOTE_STATE --post-drain-delay 75s --drain-timeout 30m
5. What happened after the commands executed?
The cluster initiation of the upgrade went smoothly. The master nodes were successfully updated; however, an issue arose during the update process of the warmPool autoscaling groups. The update became stuck as instances were being added to the cluster instead of simply undergoing warming up and subsequent powering off.
The following error was appearing in the kops update logs:
I1002 12:02:19.415658 31 instancegroups.go:565] Cluster did not pass validation, will retry in "30s": node "i-04b854ec78e845f96" of role "node" is not ready, system-node-critical pod "aws-node-4chll" is pending, system-node-critical pod "ebs-csi-node-wcz74" is pending, system-node-critical pod "efs-csi-node-7q2j8" is pending, system-node-critical pod "kube-proxy-i-04b854ec78e845f96" is pending, system-node-critical pod "node-local-dns-mdvq7" is pending.
Those nodes in the Kubernetes cluster were displayed as 'NotReady,SchedulingDisabled' when using the 'kubectl get nodes' command. I waited for 10 minutes, but there was no progress. Subsequently, I resorted to manually deleting the problematic nodes. This action successfully resolved the issue, allowing the cluster upgrade process to resume smoothly.
After completing the upgrade, I conducted another test by manually removing warmed-up nodes from the AWS console. This action led to the creation of new warmup nodes, which were subsequently added to the k8s cluster. These newly added nodes remained in a 'NotReady, SchedulingDisabled' state until I removed them manually.
Autoscaler logs for one of those nodes:
1002 13:02:34.149584 1 pre_filtering_processor.go:57] Node i-0cfcda3548f955e05 should not be processed by cluster autoscaler (no node group config)
And the relevant log line from the kops-controler:
E1002 13:02:10.796429 1 controller.go:329] "msg"="Reconciler error" "error"="error identifying node \"i-0cfcda3548f955e05\": found instance \"i-0cfcda3548f955e05\", but state is \"stopped\"" "Node"={"name":"i-0cfcda3548f955e05"} "controller"="node" "controllerGroup"="" "controllerKind"="Node" "name"="i-0cfcda3548f955e05" "namespace"="" "reconcileID"="b532008b-db8f-4273-90ad-f0bf9d40858c"
Also kube-system pods are pending to be created on those nodes for some reason:
6. What did you expect to happen? I anticipate the warmup nodes to be activated and subsequently shut down without being integrated into the cluster.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest. You may want to remove your cluster name and other sensitive information.8. Please run the commands with most verbose logging by adding the
-v 10
flag. Paste the logs into this report, or in a gist and provide the gist link here.9. Anything else do we need to know?