k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.08k stars 2.35k forks source link

k3s on external mysql db is stuck, no scheduling works, no new pods #3286

Closed golgotow closed 3 years ago

golgotow commented 3 years ago

Environmental Info: K3s Version: root@brixone:~# k3s -v k3s version v1.20.5+k3s1 (355fff30) go version go1.15.10

root@kube1-worker:~# k3s -v k3s version v1.20.6+k3s1 (8d043282) go version go1.15.10

Node(s) CPU architecture, OS, and Version: x86, 5x debian 10.9

root@brixone:~# kubectl get nodes -o wide
NAME           STATUS   ROLES                  AGE   VERSION        INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
kube1-worker   Ready    <none>                 27m   v1.20.6+k3s1   192.168.12.1     <none>        Debian GNU/Linux 10 (buster)   4.19.0-16-amd64   containerd://1.4.4-k3s1
kube2-worker   Ready    <none>                 22m   v1.20.6+k3s1   192.168.12.2     <none>        Debian GNU/Linux 10 (buster)   4.19.0-16-amd64   containerd://1.4.4-k3s1
msi55          Ready    control-plane,master   25h   v1.20.6+k3s1   192.168.14.55    <none>        Debian GNU/Linux 10 (buster)   5.4.106-1-pve     containerd://1.4.4-k3s1
brixone        Ready    control-plane,master   28d   v1.20.5+k3s1   192.168.14.246   <none>        Debian GNU/Linux 10 (buster)   5.4.106-1-pve     containerd://1.4.4-k3s1
corosync       Ready    control-plane,master   28d   v1.20.5+k3s1   192.168.14.253   <none>        Debian GNU/Linux 10 (buster)   5.4.106-1-pve     containerd://1.4.4-k3s1

Cluster Configuration: currently 3 masters ,2 workers,

Describe the bug: Problem is that cluster reports 3 masters and 2 workers as Ready, but no schedling of pods works, problem appear after 1 master drop, 2 others just start to not work, but reports as Ready.

Steps To Reproduce: Hard to say, I have k3s with 3 masters and 3 workers, after 1 master was restarted , cluster start to not scheduling new pods. We tried to restart whole cluster , but then all 3 workers ends in NonReady status, so we removed them and added 2 others that are in Ready, but non pods are created on them.

Expected behavior: Cluster to work

Actual behavior: no new pods are creating in masters or workers

Additional context / logs: kubectl get nodes NAME STATUS ROLES AGE VERSION kube1-worker Ready 26m v1.20.6+k3s1 kube2-worker Ready 20m v1.20.6+k3s1 msi55 Ready control-plane,master 25h v1.20.6+k3s1 brixone Ready control-plane,master 28d v1.20.5+k3s1 corosync Ready control-plane,master 28d v1.20.5+k3s1

brandond commented 3 years ago

Can you attach k3s service logs from one or all of the K3s servers? Kinda hard to tell what's going on from just a list of nodes. Performance statistics from the database server might also be useful - CPU/memory utilization, disk latency, etc.

golgotow commented 3 years ago

Hello,

yes sure,

k3s.log

brandond commented 3 years ago

There appear to be a couple things going on. First of all, I see occasional slow datastore requests. IO to the datastore should generally complete in no more than 500ms; some operations are taking a second or longer. Here's an example:

May  7 00:07:31 corosync k3s[2308520]: I0507 00:07:31.667567 2308520 trace.go:205] Trace[2082060259]: "GuaranteedUpdate etcd3" type:*unstructured.Unstructured (07-May-2021 00:07:30.656) (total time: 1011ms):
May  7 00:07:31 corosync k3s[2308520]: Trace[2082060259]: ---"Transaction committed" 1008ms (00:07:00.666)
May  7 00:07:31 corosync k3s[2308520]: Trace[2082060259]: [1.011374462s] [1.011374462s] END
May  7 00:07:31 corosync k3s[2308520]: I0507 00:07:31.667992 2308520 trace.go:205] Trace[1230276317]: "Update" url:/apis/longhorn.io/v1beta1/namespaces/longhorn-system/nodes/prox2/status,user-agent:longhorn-manager/v0.0.0 (linux/amd64) kubernetes/$Format,client:192.168.13.179 (07-May-2021 00:07:30.655) (total time: 1012ms):
May  7 00:07:31 corosync k3s[2308520]: Trace[1230276317]: [1.012650846s] [1.012650846s] END

Second, I see a node that is NOT listed in your kubectl get nodes output continuously bootstrapping and then disconnecting:

May  7 00:08:11 corosync k3s[2308520]: time="2021-05-07T00:08:11.305174101+02:00" level=info msg="Cluster-Http-Server 2021/05/07 00:08:11 http: TLS handshake error from 192.168.13.178:32874: remote error: tls: bad certificate"
May  7 00:08:11 corosync k3s[2308520]: time="2021-05-07T00:08:11.336177773+02:00" level=info msg="Cluster-Http-Server 2021/05/07 00:08:11 http: TLS handshake error from 192.168.13.178:32886: remote error: tls: bad certificate"
May  7 00:08:11 corosync k3s[2308520]: time="2021-05-07T00:08:11.647552647+02:00" level=info msg="certificate CN=prox1 signed by CN=k3s-server-ca@1617871439: notBefore=2021-04-08 08:43:59 +0000 UTC notAfter=2022-05-06 22:08:11 +0000 UTC"
May  7 00:08:11 corosync k3s[2308520]: time="2021-05-07T00:08:11.911239246+02:00" level=info msg="certificate CN=system:node:prox1,O=system:nodes signed by CN=k3s-client-ca@1617871439: notBefore=2021-04-08 08:43:59 +0000 UTC notAfter=2022-05-06 22:08:11 +0000 UTC"
May  7 00:08:13 corosync k3s[2308520]: time="2021-05-07T00:08:13.018347539+02:00" level=info msg="Handling backend connection request [prox1]"
May  7 00:08:14 corosync k3s[2308520]: time="2021-05-07T00:08:14.164300174+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
May  7 00:08:21 corosync k3s[2308520]: time="2021-05-07T00:08:21.499062862+02:00" level=info msg="Handling backend connection request [prox1]"
May  7 00:08:22 corosync k3s[2308520]: time="2021-05-07T00:08:22.615602638+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
May  7 00:08:29 corosync k3s[2308520]: time="2021-05-07T00:08:29.814353309+02:00" level=info msg="Handling backend connection request [prox1]"
May  7 00:08:30 corosync k3s[2308520]: time="2021-05-07T00:08:30.941233461+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
May  7 00:08:38 corosync k3s[2308520]: time="2021-05-07T00:08:38.433169046+02:00" level=info msg="Handling backend connection request [prox1]"
May  7 00:08:39 corosync k3s[2308520]: time="2021-05-07T00:08:39.583403396+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
May  7 00:08:47 corosync k3s[2308520]: time="2021-05-07T00:08:47.137879819+02:00" level=info msg="Handling backend connection request [prox1]"
May  7 00:08:48 corosync k3s[2308520]: time="2021-05-07T00:08:48.283513507+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
May  7 00:08:55 corosync k3s[2308520]: time="2021-05-07T00:08:55.572060753+02:00" level=info msg="Handling backend connection request [prox1]"
May  7 00:08:56 corosync k3s[2308520]: time="2021-05-07T00:08:56.683521123+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"

Do you have a proxy or something between the nodes in your cluster?

golgotow commented 3 years ago

Hello,

no there is no proxy, issue was strange, this for couple of days works in some lazy mode, but yesterday cluster start to respond pretty ok, no reasony found why . This prox1/prox2/prox3 nodes was removed, but k3s still tried to do some actions on them.