Closed kamilgregorczyk closed 1 year ago
I restarted everything and booted only master to get the same result
I waited 30 minutes but nothing happen, I managed to drain nodes manually with this script:
#!/bin/bash
KUBECTL="/usr/local/bin/kubectl"
NOT_READY_NODES=$($KUBECTL get nodes | grep 'NotReady' | awk '{print $1}')
while IFS= read -r line; do
if [[ ! $line =~ [^[:space:]] ]] ; then
continue
fi
echo "Found $line node to be dead, draining..."
$KUBECTL drain --ignore-daemonsets --force $line
done <<< "$NOT_READY_NODES"
READY_NODES=$(kubectl get nodes | grep '\sReady,SchedulingDisabled' | awk '{print $1}')
while IFS= read -r line; do
if [[ ! $line =~ [^[:space:]] ]] ; then
continue
fi
echo "Found $line node to be online again, undraining..."
$KUBECTL uncordon $line
done <<< "$READY_NODES"
although this script should never be needed, the whole point of Kubernetes is to have ability to self healing
I found that when you delete a NotReady node it will actually reassign pods but worker gets added to the cluster only after k3s-agent service is rebooted
I powered off a worker node (worker2) on my 3 node Raspberry PI4 cluster running Rook/Ceph some 3.5 hours ago and my cluster still has not really recovered. If we overlook the the Wordpress failure due to the fact that the the new instance cannot bind to the pvc because still thinks there is a claim from the terminating instance on the powered off node the k3s provisioned traefik lb instance is still listed as terminating and hanging there.
The things that have recovered are the ones (mostly rook) that do not have a pvc so even though the instances on the failed node are still listed as terminating it does not stop the new instances coming up.
Am I missing something here regarding Kunernetes node failure.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system pod/helm-install-traefik-2zd8t 0/1 Completed 0 11d 10.42.0.3 master <none> <none>
kubernetes-dashboard pod/kubernetes-dashboard-544f4d6b8c-4bmbm 1/1 Running 1 2d 10.42.1.127 worker1 <none> <none>
kubernetes-dashboard pod/dashboard-metrics-scraper-744c77948-n2z5w 1/1 Running 1 2d 10.42.1.126 worker1 <none> <none>
kube-system pod/svclb-traefik-zq5sw 3/3 Running 30 11d 10.42.1.128 worker1 <none> <none>
cert-manager pod/cert-manager-5c47f46f57-ww4ql 1/1 Running 1 45h 10.42.0.114 master <none> <none>
kube-system pod/metrics-server-6d684c7b5-pgmtf 1/1 Running 9 11d 10.42.0.117 master <none> <none>
kube-system pod/local-path-provisioner-58fb86bdfd-xxkr9 1/1 Running 9 11d 10.42.0.118 master <none> <none>
kube-system pod/svclb-traefik-q6tx6 3/3 Running 27 11d 10.42.0.115 master <none> <none>
cert-manager pod/cert-manager-webhook-547567b88f-4nhx9 1/1 Running 1 45h 10.42.0.112 master <none> <none>
kube-system pod/coredns-d798c9dd-b5h2l 1/1 Running 9 11d 10.42.0.119 master <none> <none>
kube-system pod/traefik-65bccdc4bd-2qglj 1/1 Running 9 11d 10.42.0.116 master <none> <none>
rook-ceph pod/rook-discover-dthqw 1/1 Running 0 17h 10.42.0.120 master <none> <none>
rook-ceph pod/rook-discover-jb5gm 1/1 Running 0 17h 10.42.1.129 worker1 <none> <none>
rook-ceph pod/rook-ceph-agent-fhct7 1/1 Running 0 17h 192.168.10.107 worker1 <none> <none>
rook-ceph pod/rook-ceph-agent-wkl5s 1/1 Running 0 17h 192.168.10.102 master <none> <none>
rook-ceph pod/rook-ceph-mon-a-7987b7749c-dqhv9 1/1 Running 0 17h 10.42.1.132 worker1 <none> <none>
rook-ceph pod/rook-ceph-mon-c-59d7b8fb4d-7sqjj 1/1 Running 0 17h 10.42.0.122 master <none> <none>
rook-ceph pod/rook-ceph-crashcollector-worker1-6bbbbf6696-zxzqc 1/1 Running 0 17h 10.42.1.133 worker1 <none> <none>
rook-ceph pod/rook-ceph-crashcollector-master-8cf749cdc-zw6ph 1/1 Running 0 17h 10.42.0.123 master <none> <none>
rook-ceph pod/rook-ceph-osd-1-dbb578859-6rv64 1/1 Running 0 17h 10.42.1.135 worker1 <none> <none>
rook-ceph pod/rook-ceph-osd-2-6c7d9966cd-56ggs 1/1 Running 0 17h 10.42.0.125 master <none> <none>
rook-ceph pod/rook-ceph-tools-57d8bd875b-nzmdh 1/1 Running 0 17h 192.168.10.107 worker1 <none> <none>
default pod/adminer-69bcfb4764-bngsb 1/1 Running 0 15h 10.42.0.129 master <none> <none>
rook-cockroachdb-system pod/rook-cockroachdb-operator-784f89dcc5-hgzq7 1/1 Running 0 5h59m 10.42.0.130 master <none> <none>
default pod/mariadb-0 1/1 Running 0 4h4m 10.42.1.143 worker1 <none> <none>
kube-system pod/svclb-traefik-lxfjb 3/3 Running 21 10d 10.42.2.117 worker2 <none> <none>
rook-ceph pod/rook-discover-grnvm 1/1 Running 0 17h 10.42.2.119 worker2 <none> <none>
rook-ceph pod/rook-ceph-agent-5nz5d 1/1 Running 0 17h 192.168.10.95 worker2 <none> <none>
rook-ceph pod/rook-ceph-mgr-a-7f65b8f79f-kqzvw 1/1 Terminating 2 17h 10.42.2.122 worker2 <none> <none>
rook-ceph pod/rook-ceph-mgr-a-7f65b8f79f-p7vrh 1/1 Running 0 3h32m 10.42.1.144 worker1 <none> <none>
default pod/wordpress-6c7c6fcccf-8hsvc 1/1 Terminating 0 4h8m 10.42.2.134 worker2 <none> <none>
rook-ceph pod/rook-ceph-osd-0-6786789854-6qzd5 1/1 Terminating 0 17h 10.42.2.125 worker2 <none> <none>
rook-ceph pod/rook-ceph-mon-b-565bc66f97-64q84 1/1 Terminating 0 17h 10.42.2.121 worker2 <none> <none>
rook-ceph pod/rook-ceph-crashcollector-worker2-67895bf8df-f8cqr 1/1 Terminating 0 17h 10.42.2.126 worker2 <none> <none>
cert-manager pod/cert-manager-cainjector-6659d6844d-krnhk 1/1 Terminating 2 45h 10.42.2.116 worker2 <none> <none>
rook-ceph pod/rook-ceph-operator-6d794bf987-plntb 1/1 Terminating 0 17h 10.42.2.118 worker2 <none> <none>
rook-ceph pod/rook-ceph-mon-b-565bc66f97-gs8h5 0/1 Pending 0 3h27m <none> <none> <none> <none>
rook-ceph pod/rook-ceph-osd-0-6786789854-6v765 0/1 Pending 0 3h27m <none> <none> <none> <none>
default pod/wordpress-6c7c6fcccf-8mhdd 0/1 ContainerCreating 0 3h27m <none> worker1 <none> <none>
rook-ceph pod/rook-ceph-crashcollector-worker2-67895bf8df-5sksv 0/1 Pending 0 3h27m <none> <none> <none> <none>
rook-ceph pod/rook-ceph-operator-6d794bf987-bq6zm 1/1 Running 0 3h27m 10.42.0.133 master <none> <none>
cert-manager pod/cert-manager-cainjector-6659d6844d-7p7p5 1/1 Running 0 3h27m 10.42.1.145 worker1 <none> <none>
rook-ceph pod/rook-ceph-osd-prepare-master-bphzw 0/1 Completed 0 3h5m 10.42.0.135 master <none> <none>
rook-ceph pod/rook-ceph-osd-prepare-worker1-mdhmt 0/1 Completed 0 3h5m 10.42.1.146 worker1 <none> <none>
rook-ceph pod/rook-ceph-mon-d-canary-666965574c-62b2f 0/1 Pending 0 15m <none> <none> <none> <none>
NAMESPACE NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node/worker2 NotReady <none> 10d v1.16.3-k3s.2 192.168.15.9 <none> Ubuntu 19.10 5.3.0-1014-raspi2 containerd://1.3.0-k3s.5
node/master Ready master 11d v1.16.3-k3s.2 192.168.15.10 <none> Ubuntu 19.10 5.3.0-1014-raspi2 containerd://1.3.0-k3s.5
node/worker1 Ready <none> 11d v1.16.3-k3s.2 192.168.15.11 <none> Ubuntu 19.10 5.3.0-1014-raspi2 containerd://1.3.0-k3s.5
So powering up the 'failed' node allowed all the 'terminating' instances to finally end, the Rook config sorted itself out and my Wordpress instance finally came back along with Certman as the pvc (on work press) finally was released,
I learned that there's a difference between having a node in NotReady state and deleting the node. When your node goes into NotReady state then Kubernetes will not reschedule running pods to other ones as Kubernetes cannot distinguish between node restart, network error or kubelet error. Kubernetes will reschedule pods only when it's sure that they are not running and just because node is in NotReady state does not mean that pods are not running, they might be running but just the fact that Kubernetes cannot communicate with kubelet does not mean that they are not running :/ It's really a bummer for me as
1/1 Running
via kubectlAlthough that's just my point of view, it's really weird that k3s on it's own does not seem to support --pod-eviction-timeout
flag which is 5 minutes by default
The script that I published cordons the faulty nodes, drains them and then eventually deletes them, it will uncordon the node once it's in Ready state. K3s seems to be rejoining the master only when it restarts though
Please see https://kubernetes.io/docs/concepts/architecture/nodes/, from that link:
In versions of Kubernetes prior to 1.5, the node controller would force delete these unreachable pods from the apiserver. However, in 1.5 and higher, the node controller does not force delete pods until it is confirmed that they have stopped running in the cluster. You can see the pods that might be running on an unreachable node as being in the
Terminating
orUnknown
state. In cases where Kubernetes cannot deduce from the underlying infrastructure if a node has permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects running on the node to be deleted from the apiserver, and frees up their names.
So pods stuck in a Terminating
state but with a duplicate running on another node look to be expected. The --pod-eviction-timeout
flag should be able to be set like:
k3s server --kube-controller-manager-arg pod-eviction-timeout=1m
.
The key in the original issue is, "Controller detected that all Nodes are not-Ready. Entering master disruption mode.", looks to be related to https://github.com/kubernetes/kubernetes/issues/42733. If all of the nodes become Not Ready
the controller manager may refuse to evict.
@erikwilson in my case none of the pods was in Terminating/Unknown state (it was the same when only one node was NotReady) and that issue was fixed? 🤔 Will set that --kube-controller-manager-arg pod-eviction-timeout=1m
flag and see what happens
It looks like the expected behavior, also see from that docs link:
The corner case is when all zones are completely unhealthy (i.e. there are no healthy nodes in the cluster). In such case, the node controller assumes that there’s some problem with master connectivity and stops all evictions until some connectivity is restored.
@ericwilson
Ok thanks, that would tie in with last time I tried this as it was on an earlier version of Kubernetes and I was not aware of that change, also I think I have done this on a Rancher managed cluster with some node management options set so never had an issue.
Hi. I’m experiencing the same issue and mitigated it with the following script in my launch template user data:
kubectl get nodes |
awk -v "host=$(hostname)" '$1 != host && $2 == "NotReady" { print $1 }' |
xargs --no-run-if-empty kubectl delete node
So when one node goes down, the autoscaling group creates a new instance that will run the above script when booting.
I advise you to triple check that hostname
returns the correct hostname for your nodes, otherwise you risk deleting the current node...
The node draining was not working and getting stuck forever since the target node was dead. So much for HA!
We're hitting this issue consistently as well and even trying to drain the node (which is Not Ready and Disabled for scheduling)
NAME STATUS ROLES AGE VERSION node1 NotReady,SchedulingDisabled master 43m v1.17.3+k3s1 node2 Ready master 43m v1.17.3+k3s1
The pods from node1 stay in "Terminating" mode forever, until the node comes back up.
This is not just an issue. We have 1 Daemonset (rabbitmq) and that pod doesnt Terminate or get deleted, which causes other services to try to connect to it, which caused those pods to not come up right.
I noticed the same thing, I had to drain nodes and delete pods with force To get rid of such pods
Same issue here. Only masters running rancher-server. Pods are stuck on Running even though those nodes have been in NotReady for more than 15 minutes.
Does this happen when using 3+ nodes?
I've only tested this on 2 or 3 nodes and it happens for both the setups
It happens with HA when using 3 master nodes and taking 1 of the nodes down? Using what type of database?
Was using a postgres dB as the backend and when 1 node was taken down. My main use-case is on a 2 node k3s cluster and its very easy to see this.
I don't think kubernetes supports 2 nodes clusters and taking 1 node down very well, as cited in the messages above.
Also having this issue with nodes not going away after they've been replaced:
ip-10-12-82-234 NotReady <none> 12d v1.17.9+k3s1
ip-10-12-65-201 NotReady <none> 15d v1.17.9+k3s1
ip-10-12-90-123 NotReady <none> 15d v1.17.9+k3s1
ip-10-12-48-200 NotReady <none> 12d v1.17.9+k3s1
ip-10-12-78-179 NotReady <none> 12d v1.17.9+k3s1
ip-10-12-52-75 NotReady <none> 15d v1.17.9+k3s1
ip-10-12-67-220 NotReady master 29d v1.17.9+k3s1
ip-10-12-81-212 NotReady master 29d v1.17.9+k3s1
ip-10-12-55-185 NotReady master 14d v1.17.9+k3s1
ip-10-12-83-151 NotReady master 7d3h v1.17.9+k3s1
ip-10-12-49-50 NotReady master 7d3h v1.17.9+k3s1
ip-10-12-48-195 NotReady <none> 5d1h v1.17.9+k3s1
ip-10-12-68-212 NotReady <none> 5d1h v1.17.9+k3s1
ip-10-12-94-45 NotReady <none> 5d1h v1.17.9+k3s1
ip-10-12-95-46 NotReady master 3h10m v1.17.9+k3s1
ip-10-12-56-63 NotReady master 4h1m v1.17.9+k3s1
ip-10-12-79-230 NotReady master 4h13m v1.17.9+k3s1
ip-10-12-79-118 NotReady <none> 3h17m v1.17.9+k3s1
ip-10-12-88-104 NotReady <none> 3h17m v1.17.9+k3s1
ip-10-12-53-206 NotReady <none> 3h17m v1.17.9+k3s1
ip-10-12-90-16 NotReady <none> 3h1m v1.17.9+k3s1
ip-10-12-54-163 NotReady master 3h10m v1.17.9+k3s1
ip-10-12-53-78 NotReady <none> 3h1m v1.17.9+k3s1
ip-10-12-71-230 NotReady master 3h10m v1.17.9+k3s1
ip-10-12-86-199 NotReady master 4h7m v1.17.9+k3s1
ip-10-12-79-37 NotReady <none> 3h1m v1.17.9+k3s1
ip-10-12-91-161 NotReady master 146m v1.17.4+k3s1
ip-10-12-68-68 NotReady master 146m v1.17.4+k3s1
ip-10-12-57-50 NotReady master 146m v1.17.4+k3s1
ip-10-12-52-91 NotReady <none> 147m v1.17.4+k3s1
ip-10-12-84-159 NotReady <none> 146m v1.17.4+k3s1
ip-10-12-73-9 NotReady <none> 146m v1.17.4+k3s1
ip-10-12-49-200 Ready master 29m v1.17.9+k3s1
ip-10-12-70-140 Ready <none> 27m v1.17.9+k3s1
ip-10-12-84-215 Ready <none> 27m v1.17.9+k3s1
ip-10-12-55-103 Ready <none> 27m v1.17.9+k3s1
ip-10-12-83-6 Ready master 27m v1.17.9+k3s1
ip-10-12-76-62 Ready master 27m v1.17.9+k3s1
@rogersd k3s does not delete nodes on its own. It has no way of knowing if the nodes are just temporarily offline, or if they are gone forever.
If you install an out-of-tree cloud provider (such as https://github.com/kubernetes/cloud-provider-aws) it has the necessary hooks to talk to your cloud provider API, and delete nodes that have been terminated. You could also just script this manually using the Kubernetes API or kubectl, deleting nodes that have been offline (NotReady) for a period of time.
It happens with HA when using 3 master nodes and taking 1 of the nodes down? Using what type of database?
@erikwilson Same here with 3 ODroid H2 nodes and etcd with the latest k3s version.
@brandond I'm kind of late to the party here, sorry. I'm confused by your comment. Are you talking about STONITH or some veriation of that? Using some cloud provider API doesn't work if you do this on actual bare metal nodes.
Shouldn't 2 out of 3 nodes suffice to establish quorum? It doesn't matter if k8s doesn't know what's up with the misbehaving node, for all intents and purposes it's dead and it should act accordingly. It doesn't, and the question is "Why and how can I make it work?"
This hasn't been answered so far. If I got things wrong, please explain.
@fuero I was specifically replying to the comment about Kubernetes not deleting EC2 nodes that no longer exist. If you are autoscaling or otherwise dynamically provisioning cluster nodes, you need some mechanism to remove from the cluster nodes that have been terminated.
With regards to the node being 'dead' but not gone, and how pods previously running on it are handled, there are tunable timeouts in the core Kubernetes code that you can alter via CLI flags to change how long a node can be NotReady before pods on it will be rescheduled onto a different node.
I'm still experiencing this issue.
k3s version: v1.19.7+k3s1
I applied the pod eviction timeout (https://github.com/k3s-io/k3s/issues/1264#issuecomment-571237831)
--kube-controller-manager-arg pod-eviction-timeout=10s
When I shutdown a node, nothing happen during 5 minutes the pods on the powered off node are still in running state. After 5 minutes the pods on the powered off node become in a terminating state forever until I boot back the node.
I suspect my eviction of 10 seconds is not taken into account... and the 5 minutes default is what happens in my case (https://github.com/k3s-io/k3s/issues/1264#issuecomment-571225390)
After the pod eviction timeout, shouldn't my pods be reschedule to another node? Because in this case it's not HA at all...
any updates? @kamilgregorczyk @erikwilson
Hey @brandond @erikwilson
I'm able to reproduce this consistently in v1.20.4+k3s1
Start k3s with the following flags (any number of nodes)
"--kubelet-arg 'node-status-update-frequency=4s'",
"--kube-controller-manager-arg 'node-monitor-period=2s'",
"--kube-controller-manager-arg 'node-monitor-grace-period=16s'",
"--kube-apiserver-arg 'default-not-ready-toleration-seconds=20'",
"--kube-apiserver-arg 'default-unreachable-toleration-seconds=20'"
Power off a node, it is marked as NotReady
as expected
Wait for pods on that node to be rescheduled.
This does not happen.
Pods stay in Running
state indefinitely.
Tested
v1.21.1+k3s1
and it works as expected.
For anyone coming across this please not that pod-eviction-timeout
is not used post 1.13
I am seeing this issue with using kube-vip in a daemonset, more information about my issue is here.
k3s version: v1.21.4+k3s1
Ubuntu version: 21.04
My masters config:
cluster-init: true
cluster-cidr: 10.69.0.0/16
disable:
- flannel
- traefik
- servicelb
- metrics-server
- local-storage
disable-cloud-controller: true
disable-network-policy: true
docker: false
flannel-backend: none
kubelet-arg:
- "feature-gates=GracefulNodeShutdown=true"
- "feature-gates=MixedProtocolLBService=true"
node-ip: 192.168.42.10
service-cidr: 10.96.0.0/16
tls-san:
- 192.168.69.5
write-kubeconfig-mode: '644'
kube-controller-manager-arg:
- "address=0.0.0.0"
- "bind-address=0.0.0.0"
kube-proxy-arg:
- "metrics-bind-address=0.0.0.0"
kube-scheduler-arg:
- "address=0.0.0.0"
- "bind-address=0.0.0.0"
etcd-expose-metrics: true
My worker nodes:
kubelet-arg:
- "feature-gates=GracefulNodeShutdown=true"
- "feature-gates=MixedProtocolLBService=true"
node-ip: 192.168.42.13
I can see the taints were added to my k8s-0
node but the pods are not being evicted:
ubuntu@k8s-1:~$ sudo k3s kubectl get ds/kube-vip -n kube-system -o yaml
...
taints:
- effect: NoSchedule
key: node.kubernetes.io/unreachable
timeAdded: "2021-08-23T13:48:30Z"
- effect: NoExecute
key: node.kubernetes.io/unreachable
timeAdded: "2021-08-23T13:48:36Z"
...
ubuntu@k8s-1:~$ sudo k3s kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-0 NotReady control-plane,etcd,master 65d v1.21.4+k3s1
k8s-1 Ready control-plane,etcd,master 65d v1.21.4+k3s1
k8s-2 Ready control-plane,etcd,master 65d v1.21.4+k3s1
k8s-3 Ready worker 65d v1.21.4+k3s1
k8s-4 Ready worker 65d v1.21.4+k3s1
k8s-5 Ready worker 65d v1.21.4+k3s1
ubuntu@k8s-1:~$ sudo k3s kubectl get po -n kube-system -l "app.kubernetes.io/instance=kube-vip" -o wide
kube-vip-jk96t 1/1 Running 4 30d 192.168.42.12 k8s-2 <none> <none>
kube-vip-kdg8x 1/1 Running 4 30d 192.168.42.11 k8s-1 <none> <none>
kube-vip-r9vhx 1/1 Running 5 30d 192.168.42.10 k8s-0 <none> <none>
Whats managing those pods? Daemonset/deployment/etc? Whatever's going on here is core Kubernetes behavior; I suspect it's just not doing what you expected.
Hi, a workaround to use pod-eviction-timeout on K3s 1.21.4 ?
@jawabuu Is there any document I can refer to about the arguments mentioned in your notes?
Hey @brandond @erikwilson I'm able to reproduce this consistently in
v1.20.4+k3s1
Start k3s with the following flags (any number of nodes)"--kubelet-arg 'node-status-update-frequency=4s'", "--kube-controller-manager-arg 'node-monitor-period=2s'", "--kube-controller-manager-arg 'node-monitor-grace-period=16s'", "--kube-apiserver-arg 'default-not-ready-toleration-seconds=20'", "--kube-apiserver-arg 'default-unreachable-toleration-seconds=20'"
Power off a node, it is marked as
NotReady
as expected Wait for pods on that node to be rescheduled. This does not happen. Pods stay inRunning
state indefinitely.Tested
v1.21.1+k3s1
and it works as expected. For anyone coming across this please not thatpod-eviction-timeout
is not used post1.13
@jawabuu Is there any document I can refer to about the arguments mentioned in your notes?
Hey @brandond @erikwilson I'm able to reproduce this consistently in
v1.20.4+k3s1
Start k3s with the following flags (any number of nodes)"--kubelet-arg 'node-status-update-frequency=4s'", "--kube-controller-manager-arg 'node-monitor-period=2s'", "--kube-controller-manager-arg 'node-monitor-grace-period=16s'", "--kube-apiserver-arg 'default-not-ready-toleration-seconds=20'", "--kube-apiserver-arg 'default-unreachable-toleration-seconds=20'"
Power off a node, it is marked as
NotReady
as expected Wait for pods on that node to be rescheduled. This does not happen. Pods stay inRunning
state indefinitely.Tested
v1.21.1+k3s1
and it works as expected. For anyone coming across this please not thatpod-eviction-timeout
is not used post1.13
Any updates? I am experiencing the same behavior.
This would be the responsibility of the Kubernetes controller-manager. Can you show the output of kubectl get node,lease -n kube-system -o wide
?
Hi, may I check any solution for this problem? I am using v1.21.4. I also see the problem.
NAME STATUS ROLES AGE VERSION
k3-slave3 Ready <none> 118d v1.21.5+k3s2
k3s-slave2 Ready <none> 139d v1.21.4+k3s1
k3s-slave4 Ready <none> 6d23h v1.22.6+k3s1
k3-master Ready control-plane,master 139d v1.21.4+k3s1
k3s-slave1 Ready <none> 139d v1.21.4+k3s1
@cwalterhk you appear to have an agent that is running a newer version of Kubernetes than the server. This is not supported; please upgrade your servers if you are going to have agents running 1.22
I just created a new cluster using the latest version. However, I still see the same problem. Even s1 is not available, pods does not restart to other nodes.
walter@k3s-m1-mark3:~$ k get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k3s-s1-mark3 NotReady <none> 169m v1.22.6+k3s1 192.168.1.91 <none> Ubuntu 20.04.3 LTS 5.4.0-99-generic containerd://1.5.9-k3s1
k3s-m1-mark3 Ready control-plane,master 171m v1.22.6+k3s1 192.168.1.90 <none> Ubuntu 20.04.3 LTS 5.4.0-99-generic containerd://1.5.9-k3s1
k3s-s2-mark3 Ready <none> 169m v1.22.6+k3s1 192.168.1.92 <none> Ubuntu 20.04.3 LTS 5.4.0-99-generic containerd://1.5.9-k3s1
walter@k3s-m1-mark3:~$ k get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hello-world-7884c6997d-h9nwx 1/1 Running 0 6m42s 10.42.0.10 k3s-m1-mark3 <none> <none>
hello-world-7884c6997d-vxlsl 1/1 Running 0 6m42s 10.42.0.9 k3s-m1-mark3 <none> <none>
hello-world-7884c6997d-gbx8x 1/1 Running 0 6m42s 10.42.0.11 k3s-m1-mark3 <none> <none>
hello-world-7884c6997d-cdfp8 1/1 Running 0 6m42s 10.42.2.6 k3s-s2-mark3 <none> <none>
hello-world-7884c6997d-2ksws 1/1 Running 0 6m42s 10.42.2.4 k3s-s2-mark3 <none> <none>
hello-world-7884c6997d-2gflm 1/1 Running 0 6m42s 10.42.2.5 k3s-s2-mark3 <none> <none>
hello-world-7884c6997d-hsnct 1/1 Running 0 6m42s 10.42.1.6 k3s-s1-mark3 <none> <none>
hello-world-7884c6997d-5xhf7 1/1 Running 0 6m42s 10.42.1.4 k3s-s1-mark3 <none> <none>
hello-world-7884c6997d-gzbvq 1/1 Running 0 6m42s 10.42.1.5 k3s-s1-mark3 <none> <none>
hello-world-7884c6997d-gh5qc 1/1 Running 0 6m42s 10.42.1.3 k3s-s1-mark3 <none> <none>
walter@k3s-m1-mark3:~$
After waiting for about 8 minutes, it is terminating. Thank you very much. Can I check how to detect failure faster and restart the pods in another nodes?
walter@k3s-m1-mark3:~$ k get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hello-world-7884c6997d-h9nwx 1/1 Running 0 10m 10.42.0.10 k3s-m1-mark3 <none> <none>
hello-world-7884c6997d-vxlsl 1/1 Running 0 10m 10.42.0.9 k3s-m1-mark3 <none> <none>
hello-world-7884c6997d-gbx8x 1/1 Running 0 10m 10.42.0.11 k3s-m1-mark3 <none> <none>
hello-world-7884c6997d-cdfp8 1/1 Running 0 10m 10.42.2.6 k3s-s2-mark3 <none> <none>
hello-world-7884c6997d-2ksws 1/1 Running 0 10m 10.42.2.4 k3s-s2-mark3 <none> <none>
hello-world-7884c6997d-2gflm 1/1 Running 0 10m 10.42.2.5 k3s-s2-mark3 <none> <none>
hello-world-7884c6997d-gzbvq 1/1 Terminating 0 10m 10.42.1.5 k3s-s1-mark3 <none> <none>
hello-world-7884c6997d-5xhf7 1/1 Terminating 0 10m 10.42.1.4 k3s-s1-mark3 <none> <none>
hello-world-7884c6997d-gh5qc 1/1 Terminating 0 10m 10.42.1.3 k3s-s1-mark3 <none> <none>
hello-world-7884c6997d-hsnct 1/1 Terminating 0 10m 10.42.1.6 k3s-s1-mark3 <none> <none>
hello-world-7884c6997d-wqzfq 1/1 Running 0 2m47s 10.42.0.12 k3s-m1-mark3 <none> <none>
hello-world-7884c6997d-6bsx4 1/1 Running 0 2m47s 10.42.0.13 k3s-m1-mark3 <none> <none>
hello-world-7884c6997d-njgdd 1/1 Running 0 2m47s 10.42.2.8 k3s-s2-mark3 <none> <none>
hello-world-7884c6997d-9w8vh 1/1 Running 0 2m47s 10.42.2.7 k3s-s2-mark3 <none> <none>
walter@k3s-m1-mark3:~$
With these options I was able to reduce your mentioned 8 minutes to ~20 seconds:
--kubelet-arg "node-status-update-frequency=4s" \
--kube-controller-manager-arg "node-monitor-period=4s" \
--kube-controller-manager-arg "node-monitor-grace-period=16s" \
--kube-controller-manager-arg "pod-eviction-timeout=20s" \
--kube-apiserver-arg "default-not-ready-toleration-seconds=20" \
--kube-apiserver-arg "default-unreachable-toleration-seconds=20" \
With these options I was able to reduce your mentioned 8 minutes to ~20 seconds:
And where did you put these parameters? On the master node(s)? Or on the workers as well?
same problem here:
k3s-agent-large-ilg Ready
17m v1.23.8+k3s2 k3s-agent-large-kmf Ready 6d8h v1.23.8+k3s2 k3s-agent-small-uui Ready 32m v1.23.8+k3s2 k3s-control-plane-fsn1-dke Ready control-plane,etcd,master 6d8h v1.23.8+k3s2
With these options I was able to reduce your mentioned 8 minutes to ~20 seconds:
--kubelet-arg "node-status-update-frequency=4s" \ --kube-controller-manager-arg "node-monitor-period=4s" \ --kube-controller-manager-arg "node-monitor-grace-period=16s" \ --kube-controller-manager-arg "pod-eviction-timeout=20s" \ --kube-apiserver-arg "default-not-ready-toleration-seconds=20" \ --kube-apiserver-arg "default-unreachable-toleration-seconds=20" \
I had the same problem. After I added those to the systemctl service, those setting are applied to every new pod. So i had to terminate the old ones by hand, and i worked like a charm on the new one. My k3s version is v1.24.3+k3s1
--kubelet-arg
Hey, where exactly did you pass these arguments?
to the ExecStart in the systemd service: ExecStart=/usr/local/bin/k3s server --https-listen-port '7443' '--kubelet-arg' "node-status-update-frequency=4s" etc
Closing as this appears to be expected upstream behavior with a valid workaround
With these options I was able to reduce your mentioned 8 minutes to ~20 seconds:
--kubelet-arg "node-status-update-frequency=4s" \ --kube-controller-manager-arg "node-monitor-period=4s" \ --kube-controller-manager-arg "node-monitor-grace-period=16s" \ --kube-controller-manager-arg "pod-eviction-timeout=20s" \ --kube-apiserver-arg "default-not-ready-toleration-seconds=20" \ --kube-apiserver-arg "default-unreachable-toleration-seconds=20" \
Unfortunately, many of these parameters are removed from Kubernetes v1.27. See for example the node-status-update-frequency
argument on the official Kubernetes docs.
They have not been removed. They've been listed as depreciated for ages but I am not aware of any actual work to remove them and force use of a config file.
I have a deployment with a pvc attached in mode ReadWriteOnce, So to test this, i turned off the k3s service on one of the nodes, after waiting some time, the pods did go to terminating state, but now the deployment with pvc wont start because of the volume is still attached to the older pod
is it possible to delete or evict the pods instead of them being stuck in terminating stage?
@anshuman852 I think this "terminating" state means it tries to perform eviction or delete. But it is not able - either because kubelet is not responding or because there is finalizer on the pod. You can try checking pod manifests and logs of kube-controller-manager what is happening and what is the issue.
Version: k3s version v1.0.0 (18bd921c)
Describe the bug I have a cluster that consists of 1 master and 3 workers, after I unplugged 3 workers none of running pods were reassigned to master from workers and Kubectl claims that they are alive:
I believe that the self healing should happen and it should run all those pods on master, I plugged in one worker and pods from two other ones were not assigned to it
journalctl from last 20 minutes: