Open stl-victor-sudakov opened 2 weeks ago
I have tried killing the running pod, and now I again have one pod running and one pending:
$ kubectl -n kube-system get pods -l k8s-app=aws-node-termination-handler
NAME READY STATUS RESTARTS AGE
aws-node-termination-handler-577f866468-bj4gd 0/1 Pending 0 41h
aws-node-termination-handler-6c9c8d7948-vt7hh 1/1 Running 0 3m30s
Hi @stl-victor-sudakov
you should find out what's the reason of pending state by running something like this one ,
kubectl describe pod aws-node-termination-handler-577f866468-bj4gd -n kube-system .
that may happen if for the second pod , scheduler of k8s could not find any place to deploy the service .
Most of times , that means the new controller nodes are not joined to the cluster properly and that's why scheduler could not deploy the service on the target nodes .
@nuved I think I have already posted the error message above but I don't mind repeating, the relevant part of "kubectl -n kube-system describe pod aws-node-termination-handler-577f866468-bj4gd" is
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 13m (x6180 over 2d3h) default-scheduler 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 Preemption is not helpful for scheduling.
There is actually only one control node in the cluster. Is there any additional information I could provide?
UPD the complete "describe pod" output can be seen here: https://termbin.com/0sy6 (not to clutter the conversation with excessive output).
Well, that means there are not enough nodes. You should make sure if all nodes are up and ready . Kubectl get nodes -o wide I guess one of the controllers has an issue .
On Fri, Oct 4, 2024, 4:15 PM Victor Sudakov @.***> wrote:
@nuved https://github.com/nuved I think I have alredy posted the error message above but I don't mind repeating, the relevant part of "kubectl -n kube-system describe pod aws-node-termination-handler-577f866468-bj4gd" is
Events: Type Reason Age From Message
Warning FailedScheduling 13m (x6180 over 2d3h) default-scheduler 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 4 Preemption is not helpful for scheduling.
There is actually only one control node in the cluster. Is there any additional information I could provide?
— Reply to this email directly, view it on GitHub https://github.com/kubernetes/kops/issues/16870#issuecomment-2393817882, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCQEOGEDNMJV4VIZVKC2DLZZ2PG3AVCNFSM6AAAAABPHPKE2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJTHAYTOOBYGI . You are receiving this because you were mentioned.Message ID: @.***>
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
i-01dbd1dccc0e30845 Ready node 91d v1.29.3 172.22.43.73 35.90.140.78 Ubuntu 22.04.4 LTS 6.5.0-1018-aws containerd://1.7.16
i-02cf4b0fed779eb54 Ready control-plane 135d v1.29.3 172.22.48.131 34.222.92.123 Ubuntu 22.04.4 LTS 6.5.0-1018-aws containerd://1.7.16
i-05569e161b2556a75 Ready node 91d v1.29.3 172.22.35.18 34.213.33.180 Ubuntu 22.04.4 LTS 6.5.0-1018-aws containerd://1.7.16
i-06c219f4c3404e207 Ready node 91d v1.29.3 172.22.56.240 54.203.143.227 Ubuntu 22.04.4 LTS 6.5.0-1018-aws containerd://1.7.16
i-0d1c604064d671d98 Ready node 91d v1.29.3 172.22.61.60 18.237.56.79 Ubuntu 22.04.4 LTS 6.5.0-1018-aws containerd://1.7.16
$
It is a single-control-plane cluster. Also:
$ kops get instances
Using cluster from kubectl context: devXXXXXXX
ID NODE-NAME STATUS ROLES STATE INTERNAL-IP EXTERNAL-IP INSTANCE-GROUP MACHINE-TYPE
i-01dbd1dccc0e30845 i-01dbd1dccc0e30845 NeedsUpdate node 172.22.43.73 nodes-us-west-2c.YYYY t3a.xlarge
i-02cf4b0fed779eb54 i-02cf4b0fed779eb54 NeedsUpdate control-plane, control-plane 172.22.48.131 control-plane-us-west-2c.masters.YYYY t3a.medium
i-05569e161b2556a75 i-05569e161b2556a75 NeedsUpdate node 172.22.35.18 nodes-us-west-2c.YYYY t3a.xlarge
i-06c219f4c3404e207 i-06c219f4c3404e207 NeedsUpdate node 172.22.56.240 nodes-us-west-2c.YYYY t3a.xlarge
i-0d1c604064d671d98 i-0d1c604064d671d98 NeedsUpdate node 172.22.61.60 nodes-us-west-2c.YYYY t3a.xlarge
$
There is exactly one instance i-02cf4b0fed779eb54 in the control-plane-us-west-2c.masters.dev2XXXXX AWS autoscaling group, it is healthy according to AWS.
Probably you just need to adjust the replica set manually , set it to 1 . kubelet edit deployment aws-node-termination-handler -n kube-system
I'm not sure how you can change the replica size via kops . but it should be work.
Manually deleting the replicaset which had contained the old aws-node-termination-handler pod did the trick (the pod was finally replaced), but this should happen automatically and not prevent "kops rolling-update cluster" command from running smoothly.
/kind bug
1. What
kops
version are you running? The commandkops version
, will display this information. 1.30.12. What Kubernetes version are you running?
kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag. v1.29.33. What cloud provider are you using? AWS
4. What commands did you run? What is the simplest way to reproduce this issue? kops upgrade cluster --name XXX --kubernetes-version 1.29.9 --yes kops --name XXX update cluster --yes --admin kops --name XXX rolling-update cluster --yes
5. What happened after the commands executed? Cluster did not pass validation at the very beginning of the upgrade procedure:
When I looked up why the pod was pending, I found the following in "describe pod aws-node-termination-handler-577f866468-mmlx7":
There is another aws-node-termination-handler- pod running at the moment (the old one):
6. What did you expect to happen?
I expected the cluster to be upgraded go Kubernetes 1.29.9
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest. You may want to remove your cluster name and other sensitive information.8. Please run the commands with most verbose logging by adding the
-v 10
flag. Paste the logs into this report, or in a gist and provide the gist link here. Please see above the validation log.9. Anything else do we need to know? Now I would like to know how to recover from this situation and how to get rid of the aws-node-termination-handler-577f866468-mmlx7 pod which is now left in Pending state.