K3S Claims that pods are running but hosts (nodes) are dead

kamilgregorczyk commented 4 years ago

Version: k3s version v1.0.0 (18bd921c)

Describe the bug I have a cluster that consists of 1 master and 3 workers, after I unplugged 3 workers none of running pods were reassigned to master from workers and Kubectl claims that they are alive:

➜  ~ kubectl get nodes
NAME      STATUS     ROLES    AGE   VERSION
worker2   NotReady   node     15d   v1.16.3-k3s.2
worker1   NotReady   node     15d   v1.16.3-k3s.2
worker3   NotReady   node     15d   v1.16.3-k3s.2
master    Ready      master   16d   v1.16.3-k3s.2

➜  ~ kubectl get pods --all-namespaces -o wide
NAMESPACE              NAME                                                   READY   STATUS    RESTARTS   AGE    IP              NODE      NOMINATED NODE   READINESS GATES
kube-system            metrics-server-6d684c7b5-8fzld                         1/1     Running   29         16d    10.42.0.139     master    <none>           <none>
metallb-system         speaker-lv7cq                                          1/1     Running   7          3d2h   192.168.0.201   master    <none>           <none>
default                nginx-1-775985c86-4q5xq                                1/1     Running   18         5d7h   10.42.0.142     master    <none>           <none>
kube-system            coredns-d798c9dd-f2wrb                                 1/1     Running   28         16d    10.42.0.140     master    <none>           <none>
kube-system            local-path-provisioner-58fb86bdfd-8sbzq                1/1     Running   4          32h    10.42.0.141     master    <none>           <none>
kubernetes-dashboard   kubernetes-dashboard-5996555fd8-k684f                  1/1     Running   23         15d    10.42.2.59      worker2   <none>           <none>
metallb-system         speaker-hdq7h                                          1/1     Running   5          3d2h   192.168.0.203   worker2   <none>           <none>
kube-system            nginx-nginx-ingress-controller-595c6b856c-m6997        1/1     Running   3          31h    10.42.2.58      worker2   <none>           <none>
kube-system            nginx-nginx-ingress-default-backend-6595d9d88b-vff2c   1/1     Running   2          30h    10.42.1.59      worker1   <none>           <none>
metallb-system         speaker-54h22                                          1/1     Running   5          3d2h   192.168.0.202   worker1   <none>           <none>
metallb-system         controller-57967b9448-mjgcb                            1/1     Running   5          3d2h   10.42.1.60      worker1   <none>           <none>
kubernetes-dashboard   dashboard-metrics-scraper-76585494d8-bzccd             1/1     Running   31         15d    10.42.1.58      worker1   <none>           <none>
metallb-system         speaker-grzfq                                          1/1     Running   6          3d2h   192.168.0.204   worker3   <none>           <none>

I believe that the self healing should happen and it should run all those pods on master, I plugged in one worker and pods from two other ones were not assigned to it

journalctl from last 20 minutes:

pi@master:~ $ sudo journalctl -u k3s --since "20 minutes ago"
-- Logs begin at Wed 2020-01-01 22:17:01 CET, end at Thu 2020-01-02 21:30:46 CET. --
Jan 02 21:11:02 master k3s[550]: time="2020-01-02T21:11:02.232431780+01:00" level=info msg="Updating TLS secret for k3s-serving (count: 8): map[listener.cattle.io/cn-10.43.0.1:10.43
Jan 02 21:11:02 master k3s[550]: E0102 21:11:02.273803     550 controller.go:117] error syncing 'kube-system/k3s-serving': handler tls-storage: Secret "k3s-serving" is invalid: meta
Jan 02 21:11:07 master k3s[550]: I0102 21:11:07.220756     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:12:07 master k3s[550]: I0102 21:12:07.249238     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:12:24 master k3s[550]: time="2020-01-02T21:12:24.244186839+01:00" level=info msg="Updating TLS secret for k3s-serving (count: 8): map[listener.cattle.io/cn-10.43.0.1:10.43
Jan 02 21:12:24 master k3s[550]: E0102 21:12:24.293070     550 controller.go:117] error syncing 'kube-system/k3s-serving': handler tls-storage: Secret "k3s-serving" is invalid: meta
Jan 02 21:13:07 master k3s[550]: I0102 21:13:07.278093     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:14:07 master k3s[550]: I0102 21:14:07.292650     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:14:14 master k3s[550]: I0102 21:14:14.490492     550 controller.go:606] quota admission added evaluator for: replicasets.apps
Jan 02 21:14:14 master k3s[550]: I0102 21:14:14.505901     550 event.go:274] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"default", Name:"nginx-1", UID:"9f62d494-264e-43af
Jan 02 21:14:14 master k3s[550]: I0102 21:14:14.557512     550 event.go:274] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"default", Name:"nginx-1-775985c86", UID:"4984c18f
Jan 02 21:14:14 master k3s[550]: I0102 21:14:14.570165     550 event.go:274] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"default", Name:"nginx-1-775985c86", UID:"4984c18f
Jan 02 21:14:14 master k3s[550]: I0102 21:14:14.662089     550 event.go:274] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"default", Name:"nginx-1", UID:"55f3f3a9-ba7c-44de-
Jan 02 21:14:42 master k3s[550]: E0102 21:14:42.365014     550 machine.go:288] failed to get cache information for node 0: open /sys/devices/system/cpu/cpu0/cache: no such file or d
Jan 02 21:15:07 master k3s[550]: I0102 21:15:07.319919     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:15:08 master k3s[550]: time="2020-01-02T21:15:08.153599154+01:00" level=info msg="Updating TLS secret for k3s-serving (count: 8): map[listener.cattle.io/cn-10.43.0.1:10.43
Jan 02 21:15:08 master k3s[550]: E0102 21:15:08.180379     550 controller.go:117] error syncing 'kube-system/k3s-serving': handler tls-storage: Secret "k3s-serving" is invalid: meta
Jan 02 21:16:07 master k3s[550]: I0102 21:16:07.334593     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:17:04 master k3s[550]: time="2020-01-02T21:17:04.958137785+01:00" level=info msg="error in remotedialer server [400]: read tcp 192.168.0.201:6443->192.168.0.203:49916: i/o
Jan 02 21:17:07 master k3s[550]: I0102 21:17:07.362715     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:17:29 master k3s[550]: time="2020-01-02T21:17:29.298245766+01:00" level=info msg="error in remotedialer server [400]: read tcp 192.168.0.201:6443->192.168.0.202:35906: i/o
Jan 02 21:17:30 master k3s[550]: time="2020-01-02T21:17:30.580422869+01:00" level=info msg="error in remotedialer server [400]: read tcp 192.168.0.201:6443->192.168.0.201:59938: i/o
Jan 02 21:17:30 master k3s[550]: time="2020-01-02T21:17:30.580988438+01:00" level=error msg="Remotedialer proxy error" error="read tcp 192.168.0.201:59938->192.168.0.201:6443: i/o t
Jan 02 21:17:31 master k3s[550]: time="2020-01-02T21:17:31.605583377+01:00" level=info msg="error in remotedialer server [400]: read tcp 192.168.0.201:6443->192.168.0.204:52256: i/o
Jan 02 21:17:34 master k3s[550]: I0102 21:17:34.106828     550 event.go:274] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"worker2", UID:"b2288a80-db5b-43be-9c00-4330be9
Jan 02 21:17:35 master k3s[550]: time="2020-01-02T21:17:35.587707038+01:00" level=info msg="Connecting to proxy" url="wss://192.168.0.201:6443/v1-k3s/connect"
Jan 02 21:17:37 master k3s[550]: E0102 21:17:37.476561     550 pod_workers.go:191] Error syncing pod ed9ae66f-71eb-49b0-b0d3-05dedd447d5f ("local-path-provisioner-58fb86bdfd-8sbzq_k
Jan 02 21:17:38 master k3s[550]: time="2020-01-02T21:17:38.200829872+01:00" level=error msg="Failed to connect to proxy" error="dial tcp 192.168.0.201:6443: connect: no route to hos
Jan 02 21:17:38 master k3s[550]: time="2020-01-02T21:17:38.200932464+01:00" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.0.201:6443: connect: no route to host"
Jan 02 21:17:39 master k3s[550]: E0102 21:17:39.318397     550 resource_quota_controller.go:407] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the ser
Jan 02 21:17:39 master k3s[550]: E0102 21:17:39.823064     550 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.233.67
Jan 02 21:17:42 master k3s[550]: W0102 21:17:42.211175     550 garbagecollector.go:640] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to
Jan 02 21:17:43 master k3s[550]: time="2020-01-02T21:17:43.201207406+01:00" level=info msg="Connecting to proxy" url="wss://192.168.0.201:6443/v1-k3s/connect"
Jan 02 21:17:44 master k3s[550]: time="2020-01-02T21:17:44.521597124+01:00" level=error msg="Failed to connect to proxy" error="dial tcp 192.168.0.201:6443: connect: no route to hos
Jan 02 21:17:44 master k3s[550]: time="2020-01-02T21:17:44.521784474+01:00" level=error msg="Remotedialer proxy error" error="dial tcp 192.168.0.201:6443: connect: no route to host"
Jan 02 21:17:44 master k3s[550]: E0102 21:17:44.827117     550 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.233.67
Jan 02 21:17:45 master k3s[550]: E0102 21:17:45.545554     550 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
Jan 02 21:17:47 master k3s[550]: E0102 21:17:47.649744     550 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
Jan 02 21:17:47 master k3s[550]: time="2020-01-02T21:17:47.843063318+01:00" level=info msg="Handling backend connection request [worker2]"
Jan 02 21:17:49 master k3s[550]: I0102 21:17:49.252599     550 event.go:274] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"nginx-nginx-ingress-controller-595c6
Jan 02 21:17:49 master k3s[550]: I0102 21:17:49.252704     550 event.go:274] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"helm-install-traefik-7k7qv", UID:"",
Jan 02 21:17:49 master k3s[550]: I0102 21:17:49.252742     550 event.go:274] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kubernetes-dashboard", Name:"kubernetes-dashboard-599655
Jan 02 21:17:49 master k3s[550]: I0102 21:17:49.252776     550 event.go:274] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"helm-install-traefik-vgn7b", UID:"",
Jan 02 21:17:49 master k3s[550]: time="2020-01-02T21:17:49.522014337+01:00" level=info msg="Connecting to proxy" url="wss://192.168.0.201:6443/v1-k3s/connect"
Jan 02 21:17:49 master k3s[550]: time="2020-01-02T21:17:49.557293978+01:00" level=info msg="Handling backend connection request [master]"
Jan 02 21:17:49 master k3s[550]: time="2020-01-02T21:17:49.767827024+01:00" level=info msg="Handling backend connection request [worker1]"
Jan 02 21:17:52 master k3s[550]: time="2020-01-02T21:17:52.492071111+01:00" level=info msg="Handling backend connection request [worker3]"
Jan 02 21:17:56 master k3s[550]: E0102 21:17:56.271504     550 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
Jan 02 21:17:58 master k3s[550]: E0102 21:17:58.666991     550 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
Jan 02 21:18:02 master k3s[550]: time="2020-01-02T21:18:02.494515040+01:00" level=info msg="error in remotedialer server [400]: read tcp 192.168.0.201:6443->192.168.0.204:52364: i/o
Jan 02 21:18:04 master k3s[550]: time="2020-01-02T21:18:04.771525434+01:00" level=info msg="error in remotedialer server [400]: read tcp 192.168.0.201:6443->192.168.0.202:36488: i/o
Jan 02 21:18:07 master k3s[550]: I0102 21:18:07.390554     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:18:07 master k3s[550]: time="2020-01-02T21:18:07.844839055+01:00" level=info msg="error in remotedialer server [400]: read tcp 192.168.0.201:6443->192.168.0.203:51336: i/o
Jan 02 21:18:39 master k3s[550]: I0102 21:18:39.293129     550 event.go:274] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"worker2", UID:"b2288a80-db5b-43be-9c00-4330be9
Jan 02 21:18:39 master k3s[550]: I0102 21:18:39.442367     550 event.go:274] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"worker1", UID:"f4226dde-8c79-473b-88c3-9d65ffa
Jan 02 21:18:39 master k3s[550]: I0102 21:18:39.782884     550 event.go:274] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"worker3", UID:"23d267c0-0917-4852-82da-830b9e9
Jan 02 21:18:39 master k3s[550]: I0102 21:18:39.927947     550 node_lifecycle_controller.go:1058] Controller detected that all Nodes are not-Ready. Entering master disruption mode.
Jan 02 21:18:40 master k3s[550]: E0102 21:18:40.072909     550 daemon_controller.go:302] metallb-system/speaker failed with : error storing status for daemon set &v1.DaemonSet{TypeM
Jan 02 21:18:40 master k3s[550]: E0102 21:18:40.156328     550 daemon_controller.go:302] metallb-system/speaker failed with : error storing status for daemon set &v1.DaemonSet{TypeM
Jan 02 21:19:07 master k3s[550]: I0102 21:19:07.418583     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:19:42 master k3s[550]: E0102 21:19:42.370539     550 machine.go:288] failed to get cache information for node 0: open /sys/devices/system/cpu/cpu0/cache: no such file or d
Jan 02 21:20:07 master k3s[550]: I0102 21:20:07.434826     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:20:35 master k3s[550]: time="2020-01-02T21:20:35.907270451+01:00" level=info msg="Updating TLS secret for k3s-serving (count: 8): map[listener.cattle.io/cn-10.43.0.1:10.43
Jan 02 21:20:35 master k3s[550]: E0102 21:20:35.950947     550 controller.go:117] error syncing 'kube-system/k3s-serving': handler tls-storage: Secret "k3s-serving" is invalid: meta
Jan 02 21:21:07 master k3s[550]: I0102 21:21:07.463278     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:22:07 master k3s[550]: I0102 21:22:07.512476     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:23:07 master k3s[550]: I0102 21:23:07.549374     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:24:07 master k3s[550]: I0102 21:24:07.564454     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:24:42 master k3s[550]: E0102 21:24:42.365024     550 machine.go:288] failed to get cache information for node 0: open /sys/devices/system/cpu/cpu0/cache: no such file or d
Jan 02 21:25:07 master k3s[550]: I0102 21:25:07.579445     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:26:07 master k3s[550]: I0102 21:26:07.594475     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:27:07 master k3s[550]: I0102 21:27:07.623520     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:27:30 master k3s[550]: I0102 21:27:30.102377     550 node_lifecycle_controller.go:1085] Controller detected that some Nodes are Ready. Exiting master disruption mode.
Jan 02 21:27:30 master k3s[550]: time="2020-01-02T21:27:30.185484725+01:00" level=info msg="Handling backend connection request [worker2]"
Jan 02 21:28:07 master k3s[550]: I0102 21:28:07.664740     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:29:07 master k3s[550]: I0102 21:29:07.699334     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Jan 02 21:29:42 master k3s[550]: E0102 21:29:42.365156     550 machine.go:288] failed to get cache information for node 0: open /sys/devices/system/cpu/cpu0/cache: no such file or d
Jan 02 21:30:07 master k3s[550]: I0102 21:30:07.713346     550 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io

kamilgregorczyk commented 4 years ago

I restarted everything and booted only master to get the same result

kamilgregorczyk commented 4 years ago

I waited 30 minutes but nothing happen, I managed to drain nodes manually with this script:

#!/bin/bash

KUBECTL="/usr/local/bin/kubectl"

NOT_READY_NODES=$($KUBECTL get nodes | grep  'NotReady' | awk '{print $1}')

while IFS= read -r line; do
    if [[ ! $line =~ [^[:space:]] ]] ; then
        continue
    fi
    echo "Found $line node to be dead, draining..."
    $KUBECTL drain --ignore-daemonsets --force $line
done <<< "$NOT_READY_NODES"

READY_NODES=$(kubectl get nodes | grep '\sReady,SchedulingDisabled' | awk '{print $1}')

while IFS= read -r line; do
    if [[ ! $line =~ [^[:space:]] ]] ; then
        continue
    fi
    echo "Found $line node to be online again, undraining..."
    $KUBECTL uncordon $line
done <<< "$READY_NODES"

although this script should never be needed, the whole point of Kubernetes is to have ability to self healing

kamilgregorczyk commented 4 years ago

I found that when you delete a NotReady node it will actually reassign pods but worker gets added to the cluster only after k3s-agent service is rebooted

serverbaboon commented 4 years ago

I powered off a worker node (worker2) on my 3 node Raspberry PI4 cluster running Rook/Ceph some 3.5 hours ago and my cluster still has not really recovered. If we overlook the the Wordpress failure due to the fact that the the new instance cannot bind to the pvc because still thinks there is a claim from the terminating instance on the powered off node the k3s provisioned traefik lb instance is still listed as terminating and hanging there.

The things that have recovered are the ones (mostly rook) that do not have a pvc so even though the instances on the failed node are still listed as terminating it does not stop the new instances coming up.

Am I missing something here regarding Kunernetes node failure.

NAMESPACE                 NAME                                                    READY   STATUS              RESTARTS   AGE     IP               NODE      NOMINATED NODE   READINESS GATES
kube-system               pod/helm-install-traefik-2zd8t                          0/1     Completed           0          11d     10.42.0.3        master    <none>           <none>
kubernetes-dashboard      pod/kubernetes-dashboard-544f4d6b8c-4bmbm               1/1     Running             1          2d      10.42.1.127      worker1   <none>           <none>
kubernetes-dashboard      pod/dashboard-metrics-scraper-744c77948-n2z5w           1/1     Running             1          2d      10.42.1.126      worker1   <none>           <none>
kube-system               pod/svclb-traefik-zq5sw                                 3/3     Running             30         11d     10.42.1.128      worker1   <none>           <none>
cert-manager              pod/cert-manager-5c47f46f57-ww4ql                       1/1     Running             1          45h     10.42.0.114      master    <none>           <none>
kube-system               pod/metrics-server-6d684c7b5-pgmtf                      1/1     Running             9          11d     10.42.0.117      master    <none>           <none>
kube-system               pod/local-path-provisioner-58fb86bdfd-xxkr9             1/1     Running             9          11d     10.42.0.118      master    <none>           <none>
kube-system               pod/svclb-traefik-q6tx6                                 3/3     Running             27         11d     10.42.0.115      master    <none>           <none>
cert-manager              pod/cert-manager-webhook-547567b88f-4nhx9               1/1     Running             1          45h     10.42.0.112      master    <none>           <none>
kube-system               pod/coredns-d798c9dd-b5h2l                              1/1     Running             9          11d     10.42.0.119      master    <none>           <none>
kube-system               pod/traefik-65bccdc4bd-2qglj                            1/1     Running             9          11d     10.42.0.116      master    <none>           <none>
rook-ceph                 pod/rook-discover-dthqw                                 1/1     Running             0          17h     10.42.0.120      master    <none>           <none>
rook-ceph                 pod/rook-discover-jb5gm                                 1/1     Running             0          17h     10.42.1.129      worker1   <none>           <none>
rook-ceph                 pod/rook-ceph-agent-fhct7                               1/1     Running             0          17h     192.168.10.107   worker1   <none>           <none>
rook-ceph                 pod/rook-ceph-agent-wkl5s                               1/1     Running             0          17h     192.168.10.102   master    <none>           <none>
rook-ceph                 pod/rook-ceph-mon-a-7987b7749c-dqhv9                    1/1     Running             0          17h     10.42.1.132      worker1   <none>           <none>
rook-ceph                 pod/rook-ceph-mon-c-59d7b8fb4d-7sqjj                    1/1     Running             0          17h     10.42.0.122      master    <none>           <none>
rook-ceph                 pod/rook-ceph-crashcollector-worker1-6bbbbf6696-zxzqc   1/1     Running             0          17h     10.42.1.133      worker1   <none>           <none>
rook-ceph                 pod/rook-ceph-crashcollector-master-8cf749cdc-zw6ph     1/1     Running             0          17h     10.42.0.123      master    <none>           <none>
rook-ceph                 pod/rook-ceph-osd-1-dbb578859-6rv64                     1/1     Running             0          17h     10.42.1.135      worker1   <none>           <none>
rook-ceph                 pod/rook-ceph-osd-2-6c7d9966cd-56ggs                    1/1     Running             0          17h     10.42.0.125      master    <none>           <none>
rook-ceph                 pod/rook-ceph-tools-57d8bd875b-nzmdh                    1/1     Running             0          17h     192.168.10.107   worker1   <none>           <none>
default                   pod/adminer-69bcfb4764-bngsb                            1/1     Running             0          15h     10.42.0.129      master    <none>           <none>
rook-cockroachdb-system   pod/rook-cockroachdb-operator-784f89dcc5-hgzq7          1/1     Running             0          5h59m   10.42.0.130      master    <none>           <none>
default                   pod/mariadb-0                                           1/1     Running             0          4h4m    10.42.1.143      worker1   <none>           <none>
kube-system               pod/svclb-traefik-lxfjb                                 3/3     Running             21         10d     10.42.2.117      worker2   <none>           <none>
rook-ceph                 pod/rook-discover-grnvm                                 1/1     Running             0          17h     10.42.2.119      worker2   <none>           <none>
rook-ceph                 pod/rook-ceph-agent-5nz5d                               1/1     Running             0          17h     192.168.10.95    worker2   <none>           <none>
rook-ceph                 pod/rook-ceph-mgr-a-7f65b8f79f-kqzvw                    1/1     Terminating         2          17h     10.42.2.122      worker2   <none>           <none>
rook-ceph                 pod/rook-ceph-mgr-a-7f65b8f79f-p7vrh                    1/1     Running             0          3h32m   10.42.1.144      worker1   <none>           <none>
default                   pod/wordpress-6c7c6fcccf-8hsvc                          1/1     Terminating         0          4h8m    10.42.2.134      worker2   <none>           <none>
rook-ceph                 pod/rook-ceph-osd-0-6786789854-6qzd5                    1/1     Terminating         0          17h     10.42.2.125      worker2   <none>           <none>
rook-ceph                 pod/rook-ceph-mon-b-565bc66f97-64q84                    1/1     Terminating         0          17h     10.42.2.121      worker2   <none>           <none>
rook-ceph                 pod/rook-ceph-crashcollector-worker2-67895bf8df-f8cqr   1/1     Terminating         0          17h     10.42.2.126      worker2   <none>           <none>
cert-manager              pod/cert-manager-cainjector-6659d6844d-krnhk            1/1     Terminating         2          45h     10.42.2.116      worker2   <none>           <none>
rook-ceph                 pod/rook-ceph-operator-6d794bf987-plntb                 1/1     Terminating         0          17h     10.42.2.118      worker2   <none>           <none>
rook-ceph                 pod/rook-ceph-mon-b-565bc66f97-gs8h5                    0/1     Pending             0          3h27m   <none>           <none>    <none>           <none>
rook-ceph                 pod/rook-ceph-osd-0-6786789854-6v765                    0/1     Pending             0          3h27m   <none>           <none>    <none>           <none>
default                   pod/wordpress-6c7c6fcccf-8mhdd                          0/1     ContainerCreating   0          3h27m   <none>           worker1   <none>           <none>
rook-ceph                 pod/rook-ceph-crashcollector-worker2-67895bf8df-5sksv   0/1     Pending             0          3h27m   <none>           <none>    <none>           <none>
rook-ceph                 pod/rook-ceph-operator-6d794bf987-bq6zm                 1/1     Running             0          3h27m   10.42.0.133      master    <none>           <none>
cert-manager              pod/cert-manager-cainjector-6659d6844d-7p7p5            1/1     Running             0          3h27m   10.42.1.145      worker1   <none>           <none>
rook-ceph                 pod/rook-ceph-osd-prepare-master-bphzw                  0/1     Completed           0          3h5m    10.42.0.135      master    <none>           <none>
rook-ceph                 pod/rook-ceph-osd-prepare-worker1-mdhmt                 0/1     Completed           0          3h5m    10.42.1.146      worker1   <none>           <none>
rook-ceph                 pod/rook-ceph-mon-d-canary-666965574c-62b2f             0/1     Pending             0          15m     <none>           <none>    <none>           <none>

NAMESPACE   NAME           STATUS     ROLES    AGE   VERSION         INTERNAL-IP      EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION      CONTAINER-RUNTIME
            node/worker2   NotReady   <none>   10d   v1.16.3-k3s.2   192.168.15.9    <none>        Ubuntu 19.10   5.3.0-1014-raspi2   containerd://1.3.0-k3s.5
            node/master    Ready      master   11d   v1.16.3-k3s.2   192.168.15.10   <none>        Ubuntu 19.10   5.3.0-1014-raspi2   containerd://1.3.0-k3s.5
            node/worker1   Ready      <none>   11d   v1.16.3-k3s.2   192.168.15.11   <none>        Ubuntu 19.10   5.3.0-1014-raspi2   containerd://1.3.0-k3s.5

serverbaboon commented 4 years ago

So powering up the 'failed' node allowed all the 'terminating' instances to finally end, the Rook config sorted itself out and my Wordpress instance finally came back along with Certman as the pvc (on work press) finally was released,

kamilgregorczyk commented 4 years ago

I learned that there's a difference between having a node in NotReady state and deleting the node. When your node goes into NotReady state then Kubernetes will not reschedule running pods to other ones as Kubernetes cannot distinguish between node restart, network error or kubelet error. Kubernetes will reschedule pods only when it's sure that they are not running and just because node is in NotReady state does not mean that pods are not running, they might be running but just the fact that Kubernetes cannot communicate with kubelet does not mean that they are not running :/ It's really a bummer for me as

There should be a deadline like if a node is NotReady for 5 minutes then it should drain it with force, no matter if something might be running or not
Pods that are potentially running on NotReady notes should be marked somehow, definitely not shown as 1/1 Running via kubectl

Although that's just my point of view, it's really weird that k3s on it's own does not seem to support --pod-eviction-timeout flag which is 5 minutes by default

The script that I published cordons the faulty nodes, drains them and then eventually deletes them, it will uncordon the node once it's in Ready state. K3s seems to be rejoining the master only when it restarts though

erikwilson commented 4 years ago

Please see https://kubernetes.io/docs/concepts/architecture/nodes/, from that link:

In versions of Kubernetes prior to 1.5, the node controller would force delete these unreachable pods from the apiserver. However, in 1.5 and higher, the node controller does not force delete pods until it is confirmed that they have stopped running in the cluster. You can see the pods that might be running on an unreachable node as being in the Terminating or Unknown state. In cases where Kubernetes cannot deduce from the underlying infrastructure if a node has permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects running on the node to be deleted from the apiserver, and frees up their names.

So pods stuck in a Terminating state but with a duplicate running on another node look to be expected. The --pod-eviction-timeout flag should be able to be set like: k3s server --kube-controller-manager-arg pod-eviction-timeout=1m.

The key in the original issue is, "Controller detected that all Nodes are not-Ready. Entering master disruption mode.", looks to be related to https://github.com/kubernetes/kubernetes/issues/42733. If all of the nodes become Not Ready the controller manager may refuse to evict.

kamilgregorczyk commented 4 years ago

@erikwilson in my case none of the pods was in Terminating/Unknown state (it was the same when only one node was NotReady) and that issue was fixed? 🤔 Will set that --kube-controller-manager-arg pod-eviction-timeout=1m flag and see what happens

erikwilson commented 4 years ago

It looks like the expected behavior, also see from that docs link:

The corner case is when all zones are completely unhealthy (i.e. there are no healthy nodes in the cluster). In such case, the node controller assumes that there’s some problem with master connectivity and stops all evictions until some connectivity is restored.

serverbaboon commented 4 years ago

@ericwilson

Ok thanks, that would tie in with last time I tried this as it was on an earlier version of Kubernetes and I was not aware of that change, also I think I have done this on a Rancher managed cluster with some node management options set so never had an issue.

izeau commented 4 years ago

Hi. I’m experiencing the same issue and mitigated it with the following script in my launch template user data:

kubectl get nodes |
  awk -v "host=$(hostname)" '$1 != host && $2 == "NotReady" { print $1 }' |
  xargs --no-run-if-empty kubectl delete node

So when one node goes down, the autoscaling group creates a new instance that will run the above script when booting.

I advise you to triple check that hostname returns the correct hostname for your nodes, otherwise you risk deleting the current node...

The node draining was not working and getting stuck forever since the target node was dead. So much for HA!

samirsss commented 4 years ago

We're hitting this issue consistently as well and even trying to drain the node (which is Not Ready and Disabled for scheduling)

NAME STATUS ROLES AGE VERSION node1 NotReady,SchedulingDisabled master 43m v1.17.3+k3s1 node2 Ready master 43m v1.17.3+k3s1

The pods from node1 stay in "Terminating" mode forever, until the node comes back up.

This is not just an issue. We have 1 Daemonset (rabbitmq) and that pod doesnt Terminate or get deleted, which causes other services to try to connect to it, which caused those pods to not come up right.

kamilgregorczyk commented 4 years ago

I noticed the same thing, I had to drain nodes and delete pods with force To get rid of such pods

jaimehrubiks commented 4 years ago

Same issue here. Only masters running rancher-server. Pods are stuck on Running even though those nodes have been in NotReady for more than 15 minutes.

erikwilson commented 4 years ago

Does this happen when using 3+ nodes?

samirsss commented 4 years ago

I've only tested this on 2 or 3 nodes and it happens for both the setups

erikwilson commented 4 years ago

It happens with HA when using 3 master nodes and taking 1 of the nodes down? Using what type of database?

samirsss commented 4 years ago

Was using a postgres dB as the backend and when 1 node was taken down. My main use-case is on a 2 node k3s cluster and its very easy to see this.

erikwilson commented 4 years ago

I don't think kubernetes supports 2 nodes clusters and taking 1 node down very well, as cited in the messages above.

rogersd commented 3 years ago

Also having this issue with nodes not going away after they've been replaced:

ip-10-12-82-234   NotReady   <none>   12d     v1.17.9+k3s1
ip-10-12-65-201   NotReady   <none>   15d     v1.17.9+k3s1
ip-10-12-90-123   NotReady   <none>   15d     v1.17.9+k3s1
ip-10-12-48-200   NotReady   <none>   12d     v1.17.9+k3s1
ip-10-12-78-179   NotReady   <none>   12d     v1.17.9+k3s1
ip-10-12-52-75    NotReady   <none>   15d     v1.17.9+k3s1
ip-10-12-67-220   NotReady   master   29d     v1.17.9+k3s1
ip-10-12-81-212   NotReady   master   29d     v1.17.9+k3s1
ip-10-12-55-185   NotReady   master   14d     v1.17.9+k3s1
ip-10-12-83-151   NotReady   master   7d3h    v1.17.9+k3s1
ip-10-12-49-50    NotReady   master   7d3h    v1.17.9+k3s1
ip-10-12-48-195   NotReady   <none>   5d1h    v1.17.9+k3s1
ip-10-12-68-212   NotReady   <none>   5d1h    v1.17.9+k3s1
ip-10-12-94-45    NotReady   <none>   5d1h    v1.17.9+k3s1
ip-10-12-95-46    NotReady   master   3h10m   v1.17.9+k3s1
ip-10-12-56-63    NotReady   master   4h1m    v1.17.9+k3s1
ip-10-12-79-230   NotReady   master   4h13m   v1.17.9+k3s1
ip-10-12-79-118   NotReady   <none>   3h17m   v1.17.9+k3s1
ip-10-12-88-104   NotReady   <none>   3h17m   v1.17.9+k3s1
ip-10-12-53-206   NotReady   <none>   3h17m   v1.17.9+k3s1
ip-10-12-90-16    NotReady   <none>   3h1m    v1.17.9+k3s1
ip-10-12-54-163   NotReady   master   3h10m   v1.17.9+k3s1
ip-10-12-53-78    NotReady   <none>   3h1m    v1.17.9+k3s1
ip-10-12-71-230   NotReady   master   3h10m   v1.17.9+k3s1
ip-10-12-86-199   NotReady   master   4h7m    v1.17.9+k3s1
ip-10-12-79-37    NotReady   <none>   3h1m    v1.17.9+k3s1
ip-10-12-91-161   NotReady   master   146m    v1.17.4+k3s1
ip-10-12-68-68    NotReady   master   146m    v1.17.4+k3s1
ip-10-12-57-50    NotReady   master   146m    v1.17.4+k3s1
ip-10-12-52-91    NotReady   <none>   147m    v1.17.4+k3s1
ip-10-12-84-159   NotReady   <none>   146m    v1.17.4+k3s1
ip-10-12-73-9     NotReady   <none>   146m    v1.17.4+k3s1
ip-10-12-49-200   Ready      master   29m     v1.17.9+k3s1
ip-10-12-70-140   Ready      <none>   27m     v1.17.9+k3s1
ip-10-12-84-215   Ready      <none>   27m     v1.17.9+k3s1
ip-10-12-55-103   Ready      <none>   27m     v1.17.9+k3s1
ip-10-12-83-6     Ready      master   27m     v1.17.9+k3s1
ip-10-12-76-62    Ready      master   27m     v1.17.9+k3s1

brandond commented 3 years ago

@rogersd k3s does not delete nodes on its own. It has no way of knowing if the nodes are just temporarily offline, or if they are gone forever.

If you install an out-of-tree cloud provider (such as https://github.com/kubernetes/cloud-provider-aws) it has the necessary hooks to talk to your cloud provider API, and delete nodes that have been terminated. You could also just script this manually using the Kubernetes API or kubectl, deleting nodes that have been offline (NotReady) for a period of time.

fuero commented 3 years ago

It happens with HA when using 3 master nodes and taking 1 of the nodes down? Using what type of database?

@erikwilson Same here with 3 ODroid H2 nodes and etcd with the latest k3s version.

@brandond I'm kind of late to the party here, sorry. I'm confused by your comment. Are you talking about STONITH or some veriation of that? Using some cloud provider API doesn't work if you do this on actual bare metal nodes.

Shouldn't 2 out of 3 nodes suffice to establish quorum? It doesn't matter if k8s doesn't know what's up with the misbehaving node, for all intents and purposes it's dead and it should act accordingly. It doesn't, and the question is "Why and how can I make it work?"

This hasn't been answered so far. If I got things wrong, please explain.

brandond commented 3 years ago

@fuero I was specifically replying to the comment about Kubernetes not deleting EC2 nodes that no longer exist. If you are autoscaling or otherwise dynamically provisioning cluster nodes, you need some mechanism to remove from the cluster nodes that have been terminated.

With regards to the node being 'dead' but not gone, and how pods previously running on it are handled, there are tunable timeouts in the core Kubernetes code that you can alter via CLI flags to change how long a node can be NotReady before pods on it will be rescheduled onto a different node.

Platou commented 3 years ago

I'm still experiencing this issue.

k3s version: v1.19.7+k3s1

I applied the pod eviction timeout (https://github.com/k3s-io/k3s/issues/1264#issuecomment-571237831) --kube-controller-manager-arg pod-eviction-timeout=10s

When I shutdown a node, nothing happen during 5 minutes the pods on the powered off node are still in running state. After 5 minutes the pods on the powered off node become in a terminating state forever until I boot back the node.

I suspect my eviction of 10 seconds is not taken into account... and the 5 minutes default is what happens in my case (https://github.com/k3s-io/k3s/issues/1264#issuecomment-571225390)

After the pod eviction timeout, shouldn't my pods be reschedule to another node? Because in this case it's not HA at all...

any updates? @kamilgregorczyk @erikwilson

jawabuu commented 3 years ago

Hey @brandond @erikwilson I'm able to reproduce this consistently in v1.20.4+k3s1 Start k3s with the following flags (any number of nodes)

 "--kubelet-arg 'node-status-update-frequency=4s'",
    "--kube-controller-manager-arg 'node-monitor-period=2s'",
    "--kube-controller-manager-arg 'node-monitor-grace-period=16s'",
    "--kube-apiserver-arg 'default-not-ready-toleration-seconds=20'",
    "--kube-apiserver-arg 'default-unreachable-toleration-seconds=20'"

Power off a node, it is marked as NotReady as expected Wait for pods on that node to be rescheduled. This does not happen. Pods stay in Running state indefinitely.

jawabuu commented 3 years ago

Tested v1.21.1+k3s1 and it works as expected. For anyone coming across this please not that pod-eviction-timeout is not used post 1.13

onedr0p commented 2 years ago

I am seeing this issue with using kube-vip in a daemonset, more information about my issue is here.

k3s version: v1.21.4+k3s1 Ubuntu version: 21.04

My masters config:

cluster-init: true
cluster-cidr: 10.69.0.0/16
disable:
- flannel
- traefik
- servicelb
- metrics-server
- local-storage
disable-cloud-controller: true
disable-network-policy: true
docker: false
flannel-backend: none
kubelet-arg:
- "feature-gates=GracefulNodeShutdown=true"
- "feature-gates=MixedProtocolLBService=true"
node-ip: 192.168.42.10
service-cidr: 10.96.0.0/16
tls-san:
- 192.168.69.5
write-kubeconfig-mode: '644'
kube-controller-manager-arg:
- "address=0.0.0.0"
- "bind-address=0.0.0.0"
kube-proxy-arg:
- "metrics-bind-address=0.0.0.0"
kube-scheduler-arg:
- "address=0.0.0.0"
- "bind-address=0.0.0.0"
etcd-expose-metrics: true

My worker nodes:

kubelet-arg:
- "feature-gates=GracefulNodeShutdown=true"
- "feature-gates=MixedProtocolLBService=true"
node-ip: 192.168.42.13

I can see the taints were added to my k8s-0 node but the pods are not being evicted:

ubuntu@k8s-1:~$ sudo k3s kubectl get ds/kube-vip -n kube-system -o yaml
...
  taints:
  - effect: NoSchedule
    key: node.kubernetes.io/unreachable
    timeAdded: "2021-08-23T13:48:30Z"
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    timeAdded: "2021-08-23T13:48:36Z"
...

ubuntu@k8s-1:~$ sudo k3s kubectl get nodes
NAME    STATUS     ROLES                       AGE   VERSION
k8s-0   NotReady   control-plane,etcd,master   65d   v1.21.4+k3s1
k8s-1   Ready      control-plane,etcd,master   65d   v1.21.4+k3s1
k8s-2   Ready      control-plane,etcd,master   65d   v1.21.4+k3s1
k8s-3   Ready      worker                      65d   v1.21.4+k3s1
k8s-4   Ready      worker                      65d   v1.21.4+k3s1
k8s-5   Ready      worker                      65d   v1.21.4+k3s1

ubuntu@k8s-1:~$ sudo k3s kubectl get po -n kube-system -l "app.kubernetes.io/instance=kube-vip" -o wide
kube-vip-jk96t                                  1/1     Running     4          30d     192.168.42.12   k8s-2   <none>           <none>
kube-vip-kdg8x                                  1/1     Running     4          30d     192.168.42.11   k8s-1   <none>           <none>
kube-vip-r9vhx                                  1/1     Running     5          30d     192.168.42.10   k8s-0   <none>           <none>

brandond commented 2 years ago

Whats managing those pods? Daemonset/deployment/etc? Whatever's going on here is core Kubernetes behavior; I suspect it's just not doing what you expected.

piellick commented 2 years ago

Hi, a workaround to use pod-eviction-timeout on K3s 1.21.4 ?

Cryingmouse commented 2 years ago

@jawabuu Is there any document I can refer to about the arguments mentioned in your notes?

Hey @brandond @erikwilson I'm able to reproduce this consistently in v1.20.4+k3s1 Start k3s with the following flags (any number of nodes)
 "--kubelet-arg 'node-status-update-frequency=4s'",
    "--kube-controller-manager-arg 'node-monitor-period=2s'",
    "--kube-controller-manager-arg 'node-monitor-grace-period=16s'",
    "--kube-apiserver-arg 'default-not-ready-toleration-seconds=20'",
    "--kube-apiserver-arg 'default-unreachable-toleration-seconds=20'"
Power off a node, it is marked as NotReady as expected Wait for pods on that node to be rescheduled. This does not happen. Pods stay in Running state indefinitely.

Tested v1.21.1+k3s1 and it works as expected. For anyone coming across this please not that pod-eviction-timeout is not used post 1.13

bufo333 commented 2 years ago

@jawabuu Is there any document I can refer to about the arguments mentioned in your notes?
Hey @brandond @erikwilson I'm able to reproduce this consistently in v1.20.4+k3s1 Start k3s with the following flags (any number of nodes)
 "--kubelet-arg 'node-status-update-frequency=4s'",
    "--kube-controller-manager-arg 'node-monitor-period=2s'",
    "--kube-controller-manager-arg 'node-monitor-grace-period=16s'",
    "--kube-apiserver-arg 'default-not-ready-toleration-seconds=20'",
    "--kube-apiserver-arg 'default-unreachable-toleration-seconds=20'"
Power off a node, it is marked as NotReady as expected Wait for pods on that node to be rescheduled. This does not happen. Pods stay in Running state indefinitely.

Tested v1.21.1+k3s1 and it works as expected. For anyone coming across this please not that pod-eviction-timeout is not used post 1.13

Any updates? I am experiencing the same behavior.

brandond commented 2 years ago

This would be the responsibility of the Kubernetes controller-manager. Can you show the output of kubectl get node,lease -n kube-system -o wide ?

ccwalterhk commented 2 years ago

Hi, may I check any solution for this problem? I am using v1.21.4. I also see the problem.

NAME         STATUS   ROLES                  AGE     VERSION
k3-slave3    Ready    <none>                 118d    v1.21.5+k3s2
k3s-slave2   Ready    <none>                 139d    v1.21.4+k3s1
k3s-slave4   Ready    <none>                 6d23h   v1.22.6+k3s1
k3-master    Ready    control-plane,master   139d    v1.21.4+k3s1
k3s-slave1   Ready    <none>                 139d    v1.21.4+k3s1

brandond commented 2 years ago

@cwalterhk you appear to have an agent that is running a newer version of Kubernetes than the server. This is not supported; please upgrade your servers if you are going to have agents running 1.22

ccwalterhk commented 2 years ago

I just created a new cluster using the latest version. However, I still see the same problem. Even s1 is not available, pods does not restart to other nodes.

walter@k3s-m1-mark3:~$ k get node -o wide
NAME           STATUS     ROLES                  AGE    VERSION        INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k3s-s1-mark3   NotReady   <none>                 169m   v1.22.6+k3s1   192.168.1.91   <none>        Ubuntu 20.04.3 LTS   5.4.0-99-generic   containerd://1.5.9-k3s1
k3s-m1-mark3   Ready      control-plane,master   171m   v1.22.6+k3s1   192.168.1.90   <none>        Ubuntu 20.04.3 LTS   5.4.0-99-generic   containerd://1.5.9-k3s1
k3s-s2-mark3   Ready      <none>                 169m   v1.22.6+k3s1   192.168.1.92   <none>        Ubuntu 20.04.3 LTS   5.4.0-99-generic   containerd://1.5.9-k3s1
walter@k3s-m1-mark3:~$ k get pod -o wide
NAME                           READY   STATUS    RESTARTS   AGE     IP           NODE           NOMINATED NODE   READINESS GATES
hello-world-7884c6997d-h9nwx   1/1     Running   0          6m42s   10.42.0.10   k3s-m1-mark3   <none>           <none>
hello-world-7884c6997d-vxlsl   1/1     Running   0          6m42s   10.42.0.9    k3s-m1-mark3   <none>           <none>
hello-world-7884c6997d-gbx8x   1/1     Running   0          6m42s   10.42.0.11   k3s-m1-mark3   <none>           <none>
hello-world-7884c6997d-cdfp8   1/1     Running   0          6m42s   10.42.2.6    k3s-s2-mark3   <none>           <none>
hello-world-7884c6997d-2ksws   1/1     Running   0          6m42s   10.42.2.4    k3s-s2-mark3   <none>           <none>
hello-world-7884c6997d-2gflm   1/1     Running   0          6m42s   10.42.2.5    k3s-s2-mark3   <none>           <none>
hello-world-7884c6997d-hsnct   1/1     Running   0          6m42s   10.42.1.6    k3s-s1-mark3   <none>           <none>
hello-world-7884c6997d-5xhf7   1/1     Running   0          6m42s   10.42.1.4    k3s-s1-mark3   <none>           <none>
hello-world-7884c6997d-gzbvq   1/1     Running   0          6m42s   10.42.1.5    k3s-s1-mark3   <none>           <none>
hello-world-7884c6997d-gh5qc   1/1     Running   0          6m42s   10.42.1.3    k3s-s1-mark3   <none>           <none>
walter@k3s-m1-mark3:~$

ccwalterhk commented 2 years ago

After waiting for about 8 minutes, it is terminating. Thank you very much. Can I check how to detect failure faster and restart the pods in another nodes?

walter@k3s-m1-mark3:~$ k get pod -o wide
NAME                           READY   STATUS        RESTARTS   AGE     IP           NODE           NOMINATED NODE   READINESS GATES
hello-world-7884c6997d-h9nwx   1/1     Running       0          10m     10.42.0.10   k3s-m1-mark3   <none>           <none>
hello-world-7884c6997d-vxlsl   1/1     Running       0          10m     10.42.0.9    k3s-m1-mark3   <none>           <none>
hello-world-7884c6997d-gbx8x   1/1     Running       0          10m     10.42.0.11   k3s-m1-mark3   <none>           <none>
hello-world-7884c6997d-cdfp8   1/1     Running       0          10m     10.42.2.6    k3s-s2-mark3   <none>           <none>
hello-world-7884c6997d-2ksws   1/1     Running       0          10m     10.42.2.4    k3s-s2-mark3   <none>           <none>
hello-world-7884c6997d-2gflm   1/1     Running       0          10m     10.42.2.5    k3s-s2-mark3   <none>           <none>
hello-world-7884c6997d-gzbvq   1/1     Terminating   0          10m     10.42.1.5    k3s-s1-mark3   <none>           <none>
hello-world-7884c6997d-5xhf7   1/1     Terminating   0          10m     10.42.1.4    k3s-s1-mark3   <none>           <none>
hello-world-7884c6997d-gh5qc   1/1     Terminating   0          10m     10.42.1.3    k3s-s1-mark3   <none>           <none>
hello-world-7884c6997d-hsnct   1/1     Terminating   0          10m     10.42.1.6    k3s-s1-mark3   <none>           <none>
hello-world-7884c6997d-wqzfq   1/1     Running       0          2m47s   10.42.0.12   k3s-m1-mark3   <none>           <none>
hello-world-7884c6997d-6bsx4   1/1     Running       0          2m47s   10.42.0.13   k3s-m1-mark3   <none>           <none>
hello-world-7884c6997d-njgdd   1/1     Running       0          2m47s   10.42.2.8    k3s-s2-mark3   <none>           <none>
hello-world-7884c6997d-9w8vh   1/1     Running       0          2m47s   10.42.2.7    k3s-s2-mark3   <none>           <none>
walter@k3s-m1-mark3:~$

dfoxg commented 2 years ago

With these options I was able to reduce your mentioned 8 minutes to ~20 seconds:

--kubelet-arg "node-status-update-frequency=4s" \
--kube-controller-manager-arg "node-monitor-period=4s" \
--kube-controller-manager-arg "node-monitor-grace-period=16s" \
--kube-controller-manager-arg "pod-eviction-timeout=20s" \
--kube-apiserver-arg "default-not-ready-toleration-seconds=20" \
--kube-apiserver-arg "default-unreachable-toleration-seconds=20" \

janvanveldhuizen commented 2 years ago

With these options I was able to reduce your mentioned 8 minutes to ~20 seconds:

And where did you put these parameters? On the master node(s)? Or on the workers as well?

helletheone commented 1 year ago

same problem here:

k3s-agent-large-ilg Ready 17m v1.23.8+k3s2 k3s-agent-large-kmf Ready 6d8h v1.23.8+k3s2 k3s-agent-small-uui Ready 32m v1.23.8+k3s2 k3s-control-plane-fsn1-dke Ready control-plane,etcd,master 6d8h v1.23.8+k3s2

Jeffote commented 1 year ago

With these options I was able to reduce your mentioned 8 minutes to ~20 seconds:

--kubelet-arg "node-status-update-frequency=4s" \
--kube-controller-manager-arg "node-monitor-period=4s" \
--kube-controller-manager-arg "node-monitor-grace-period=16s" \
--kube-controller-manager-arg "pod-eviction-timeout=20s" \
--kube-apiserver-arg "default-not-ready-toleration-seconds=20" \
--kube-apiserver-arg "default-unreachable-toleration-seconds=20" \

I had the same problem. After I added those to the systemctl service, those setting are applied to every new pod. So i had to terminate the old ones by hand, and i worked like a charm on the new one. My k3s version is v1.24.3+k3s1

timowevel1 commented 1 year ago

--kubelet-arg

Hey, where exactly did you pass these arguments?

Jeffote commented 1 year ago

to the ExecStart in the systemd service: ExecStart=/usr/local/bin/k3s server --https-listen-port '7443' '--kubelet-arg' "node-status-update-frequency=4s" etc

caroline-suse-rancher commented 1 year ago

Closing as this appears to be expected upstream behavior with a valid workaround

framctr commented 8 months ago

With these options I was able to reduce your mentioned 8 minutes to ~20 seconds:

--kubelet-arg "node-status-update-frequency=4s" \
--kube-controller-manager-arg "node-monitor-period=4s" \
--kube-controller-manager-arg "node-monitor-grace-period=16s" \
--kube-controller-manager-arg "pod-eviction-timeout=20s" \
--kube-apiserver-arg "default-not-ready-toleration-seconds=20" \
--kube-apiserver-arg "default-unreachable-toleration-seconds=20" \

Unfortunately, many of these parameters are removed from Kubernetes v1.27. See for example the node-status-update-frequency argument on the official Kubernetes docs.

brandond commented 8 months ago

They have not been removed. They've been listed as depreciated for ages but I am not aware of any actual work to remove them and force use of a config file.

anshuman852 commented 8 months ago

I have a deployment with a pvc attached in mode ReadWriteOnce, So to test this, i turned off the k3s service on one of the nodes, after waiting some time, the pods did go to terminating state, but now the deployment with pvc wont start because of the volume is still attached to the older pod

is it possible to delete or evict the pods instead of them being stuck in terminating stage?

camaeel commented 8 months ago

@anshuman852 I think this "terminating" state means it tries to perform eviction or delete. But it is not able - either because kubelet is not responding or because there is finalizer on the pod. You can try checking pod manifests and logs of kube-controller-manager what is happening and what is the issue.

k3s-io / k3s

K3S Claims that pods are running but hosts (nodes) are dead #1264