Closed golgotow closed 3 years ago
Can you attach k3s service logs from one or all of the K3s servers? Kinda hard to tell what's going on from just a list of nodes. Performance statistics from the database server might also be useful - CPU/memory utilization, disk latency, etc.
There appear to be a couple things going on. First of all, I see occasional slow datastore requests. IO to the datastore should generally complete in no more than 500ms; some operations are taking a second or longer. Here's an example:
May 7 00:07:31 corosync k3s[2308520]: I0507 00:07:31.667567 2308520 trace.go:205] Trace[2082060259]: "GuaranteedUpdate etcd3" type:*unstructured.Unstructured (07-May-2021 00:07:30.656) (total time: 1011ms):
May 7 00:07:31 corosync k3s[2308520]: Trace[2082060259]: ---"Transaction committed" 1008ms (00:07:00.666)
May 7 00:07:31 corosync k3s[2308520]: Trace[2082060259]: [1.011374462s] [1.011374462s] END
May 7 00:07:31 corosync k3s[2308520]: I0507 00:07:31.667992 2308520 trace.go:205] Trace[1230276317]: "Update" url:/apis/longhorn.io/v1beta1/namespaces/longhorn-system/nodes/prox2/status,user-agent:longhorn-manager/v0.0.0 (linux/amd64) kubernetes/$Format,client:192.168.13.179 (07-May-2021 00:07:30.655) (total time: 1012ms):
May 7 00:07:31 corosync k3s[2308520]: Trace[1230276317]: [1.012650846s] [1.012650846s] END
Second, I see a node that is NOT listed in your kubectl get nodes
output continuously bootstrapping and then disconnecting:
May 7 00:08:11 corosync k3s[2308520]: time="2021-05-07T00:08:11.305174101+02:00" level=info msg="Cluster-Http-Server 2021/05/07 00:08:11 http: TLS handshake error from 192.168.13.178:32874: remote error: tls: bad certificate"
May 7 00:08:11 corosync k3s[2308520]: time="2021-05-07T00:08:11.336177773+02:00" level=info msg="Cluster-Http-Server 2021/05/07 00:08:11 http: TLS handshake error from 192.168.13.178:32886: remote error: tls: bad certificate"
May 7 00:08:11 corosync k3s[2308520]: time="2021-05-07T00:08:11.647552647+02:00" level=info msg="certificate CN=prox1 signed by CN=k3s-server-ca@1617871439: notBefore=2021-04-08 08:43:59 +0000 UTC notAfter=2022-05-06 22:08:11 +0000 UTC"
May 7 00:08:11 corosync k3s[2308520]: time="2021-05-07T00:08:11.911239246+02:00" level=info msg="certificate CN=system:node:prox1,O=system:nodes signed by CN=k3s-client-ca@1617871439: notBefore=2021-04-08 08:43:59 +0000 UTC notAfter=2022-05-06 22:08:11 +0000 UTC"
May 7 00:08:13 corosync k3s[2308520]: time="2021-05-07T00:08:13.018347539+02:00" level=info msg="Handling backend connection request [prox1]"
May 7 00:08:14 corosync k3s[2308520]: time="2021-05-07T00:08:14.164300174+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
May 7 00:08:21 corosync k3s[2308520]: time="2021-05-07T00:08:21.499062862+02:00" level=info msg="Handling backend connection request [prox1]"
May 7 00:08:22 corosync k3s[2308520]: time="2021-05-07T00:08:22.615602638+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
May 7 00:08:29 corosync k3s[2308520]: time="2021-05-07T00:08:29.814353309+02:00" level=info msg="Handling backend connection request [prox1]"
May 7 00:08:30 corosync k3s[2308520]: time="2021-05-07T00:08:30.941233461+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
May 7 00:08:38 corosync k3s[2308520]: time="2021-05-07T00:08:38.433169046+02:00" level=info msg="Handling backend connection request [prox1]"
May 7 00:08:39 corosync k3s[2308520]: time="2021-05-07T00:08:39.583403396+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
May 7 00:08:47 corosync k3s[2308520]: time="2021-05-07T00:08:47.137879819+02:00" level=info msg="Handling backend connection request [prox1]"
May 7 00:08:48 corosync k3s[2308520]: time="2021-05-07T00:08:48.283513507+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
May 7 00:08:55 corosync k3s[2308520]: time="2021-05-07T00:08:55.572060753+02:00" level=info msg="Handling backend connection request [prox1]"
May 7 00:08:56 corosync k3s[2308520]: time="2021-05-07T00:08:56.683521123+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Do you have a proxy or something between the nodes in your cluster?
Hello,
no there is no proxy, issue was strange, this for couple of days works in some lazy mode, but yesterday cluster start to respond pretty ok, no reasony found why . This prox1/prox2/prox3 nodes was removed, but k3s still tried to do some actions on them.
Environmental Info: K3s Version: root@brixone:~# k3s -v k3s version v1.20.5+k3s1 (355fff30) go version go1.15.10
root@kube1-worker:~# k3s -v k3s version v1.20.6+k3s1 (8d043282) go version go1.15.10
Node(s) CPU architecture, OS, and Version: x86, 5x debian 10.9
Cluster Configuration: currently 3 masters ,2 workers,
Describe the bug: Problem is that cluster reports 3 masters and 2 workers as Ready, but no schedling of pods works, problem appear after 1 master drop, 2 others just start to not work, but reports as Ready.
Steps To Reproduce: Hard to say, I have k3s with 3 masters and 3 workers, after 1 master was restarted , cluster start to not scheduling new pods. We tried to restart whole cluster , but then all 3 workers ends in NonReady status, so we removed them and added 2 others that are in Ready, but non pods are created on them.
Expected behavior: Cluster to work
Actual behavior: no new pods are creating in masters or workers
Additional context / logs: kubectl get nodes NAME STATUS ROLES AGE VERSION kube1-worker Ready 26m v1.20.6+k3s1
kube2-worker Ready 20m v1.20.6+k3s1
msi55 Ready control-plane,master 25h v1.20.6+k3s1
brixone Ready control-plane,master 28d v1.20.5+k3s1
corosync Ready control-plane,master 28d v1.20.5+k3s1