k3s.service: Main process exited, code=exited, status=255/EXCEPTION

BoBiene commented 3 years ago

I have tested this in several setups with k3s v19.7+k3s1 and v20.0+k3s2 but the nodes keep failing after some hours of running.

Environmental Info: K3s Version: k3s version v1.20.0+k3s2 (2ea6b163) go version go1.15.5

Node(s) CPU architecture, OS, and Version: Linux admin-ln03 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Ubuntu 20.04.1 LTS (Focal Fossa) Hosted on Hetzner cloud

Cluster Configuration: 3 leader, 1 worker loadbalancer for API

Describe the bug: k3s crashes after some hours of operation k3s.service: Main process exited, code=exited, status=255/EXCEPTION

Steps To Reproduce:

Installed K3s:
embedded etcd
--disable traefik
Treafik v2 (helm chart)
longhorn

Expected behavior: k3s should not crash

Actual behavior: k3s crashes after some hours of operation k3s.service: Main process exited, code=exited, status=255/EXCEPTION

Additional context / logs:

Uptimes

*****************************
on admin-ln01
executing: systemctl status k3s --no-block -n 0 | grep Active:
*****************************
     Active: active (running) since Sun 2021-01-24 23:06:21 CET; 8h ago

*****************************
on admin-ln02
executing: systemctl status k3s --no-block -n 0 | grep Active:
*****************************
     Active: active (running) since Mon 2021-01-25 03:47:09 CET; 3h 36min ago

*****************************
on admin-ln03
executing: systemctl status k3s --no-block -n 0 | grep Active:
*****************************
     Active: active (running) since Mon 2021-01-25 03:47:13 CET; 3h 36min ago

Logfiles

journalctl --since 03:46 --until 03:48 Replaced all ip-addresses with <IP-Blocked>

admin-ln01.log admin-ln02.log admin-ln03.log

brandond commented 3 years ago

At first glance, I see UFW blocking a lot of traffic, including on the flannel interface. Have you confirmed that all the appropriate ports and interfaces are open in the ufw configuration?

BoBiene commented 3 years ago

@brandond UFW setting is:

ufw allow from 10.43.0.0/16
ufw allow from 10.42.0.0/16
ufw allow from 10.0.0.0/16 
ufw allow from 10.244.0.0/16 

ufw -f default deny incoming
ufw -f default allow outgoing
ufw -f default allow forward

brandond commented 3 years ago

We'll, I can't really tell what's going on at all because you've redacted all the IPs with the same value... but it is blocking traffic on the CNI interface, and both etcd and the k3s tunnel proxy are logging connection errors. Maybe start by disabling ufw, see if everything works, and if it does then figure out what else you need to open up?

BoBiene commented 3 years ago

Here is the log from a different cluster:

k3s version v1.19.7+k3s1 (5a00e38d)

*****************************
on dev-ln01
executing: systemctl status k3s --no-block -n 0 | grep Active:
*****************************
     Active: active (running) since Mon 2021-01-25 06:43:22 CET; 2h 26min ago

*****************************
on dev-ln02
executing: systemctl status k3s --no-block -n 0 | grep Active:
*****************************
     Active: active (running) since Sun 2021-01-24 18:10:36 CET; 14h ago

*****************************
on dev-ln03
executing: systemctl status k3s --no-block -n 0 | grep Active:
*****************************
     Active: active (running) since Sun 2021-01-24 21:06:27 CET; 12h ago

Logs journalctl --since 06:42 --until 06:45 dev-ln01.log

BoBiene commented 3 years ago

@brandond on this cluster there is only block on eth0

brandond commented 3 years ago

On this cluster your disk is too slow


Jan 25 06:42:29 dev-ln01 k3s[2189071]: {"level":"warn","ts":"2021-01-25T06:42:29.227+0100","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"680.988822ms","expected-duration":"100ms","prefix":"","request":"header:<ID:8912191783742552676 username:\"etcd-client\" auth_revision:1 > lease_grant:<ttl:15-second id:7bae773560213a63>","response":"size:40"}
Jan 25 06:42:36 dev-ln01 k3s[2189071]: {"level":"warn","ts":"2021-01-25T06:42:36.000+0100","caller":"wal/wal.go:806","msg":"slow fdatasync","took":"7.454439372s","expected-duration":"1s"}
Jan 25 06:42:41 dev-ln01 k3s[2189071]: E0125 06:42:39.880363 2189071 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://<IP-Blocked>:6444/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan 25 06:42:41 dev-ln01 k3s[2189071]: E0125 06:42:39.880390 2189071 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
Jan 25 06:42:41 dev-ln01 k3s[2189071]: {"level":"warn","ts":"2021-01-25T06:42:39.880+0100","caller":"etcdserver/util.go:163","msg":"apply request took too long","took":"9.997793255s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" ","response":"","error":"context canceled"}
Jan 25 06:42:41 dev-ln01 k3s[2189071]: {"level":"info","ts":"2021-01-25T06:42:39.880+0100","caller":"traceutil/trace.go:145","msg":"trace[301670872] range","detail":"{range_begin:/registry/services/endpoints/kube-system/kube-controller-manager; range_end:; }","duration":"9.9979363s","start":"2021-01-25T06:42:29.882+0100","end":"2021-01-25T06:42:39.880+0100","steps":["trace[301670872] 'agreement among raft nodes before linearized reading'  (duration: 9.997851102s)"]}
Jan 25 06:42:41 dev-ln01 k3s[2189071]: I0125 06:42:39.880693 2189071 trace.go:205] Trace[211670244]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:k3s/v1.19.7+k3s1 (linux/amd64) kubernetes/5a00e38/leader-election,client:<IP-Blocked> (25-Jan-2021 06:42:29.881) (total time: 9998ms):
Jan 25 06:42:41 dev-ln01 k3s[2189071]: Trace[211670244]: [9.998741769s] [9.998741769s] END

BoBiene commented 3 years ago

this is causing the crash?

brandond commented 3 years ago

Yes. Etcd disk IO is so slow that the controllers cannot renew their leases and everything crashes. The log messages from etcd even tell you what the latencies should be. Have you looked through these yourself yet?

BoBiene commented 3 years ago

Thank you @brandond, I have reported this to the cloud provider and hope they look at it quickly.

brandond commented 3 years ago

Are you using rotational storage, or ssd? Are you running io-intensive workload pods on the same disk? Etcd is quite sensitive to disk latency as every write to the datastore causes a fsync, which in turn flushes all pending/cached filesystem writes to disk.

BoBiene commented 3 years ago

Leader-Nodes are Hetzner CPX21 (https://www.hetzner.com/cloud)

3 vCPU
4 GB RAM
80 GB NVMe SSD

Longhorn is using the leader nodes on the same storage.

brandond commented 3 years ago

Putting longhorn and etcd on the same disk might be part of the problem. Does it run OK minus longhorn? They're both pretty hard on the disk.

BoBiene commented 3 years ago

I have moved longhorn to a separate storage and verified the ufw on all cluster nodes. Unitil now i have no more restarts since the config change (> 20h).

Thanks @brandond for the quick help 👏👍 I will monitor this for the next seven days.

BoBiene commented 3 years ago

@brandond today i had one crash of one leader node: dev-ln02.log

Jan 27 09:18:24 dev-ln02 k3s[2041650]: E0127 09:18:24.353457 2041650 leaderelection.go:325] error retrieving resource lock kube-system/cloud-controller-manager: etcdserver: request timed out
Jan 27 09:18:24 dev-ln02 k3s[2041650]: I0127 09:18:24.353736 2041650 trace.go:205] Trace[1630274928]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:k3s/v1.19.7+k3s1 (linux/amd64) kubernetes/5a00e38/leader-election,client:<IP-Blocked> (27-Jan-2021 09:18:21.555) (total time: 2798ms):
Jan 27 09:18:24 dev-ln02 k3s[2041650]: Trace[1630274928]: ---"About to write a response" 2798ms (09:18:00.353)
Jan 27 09:18:24 dev-ln02 k3s[2041650]: Trace[1630274928]: [2.798420512s] [2.798420512s] END
Jan 27 09:18:24 dev-ln02 k3s[2041650]: I0127 09:18:24.353973 2041650 trace.go:205] Trace[1043530145]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:k3s/v1.19.7+k3s1 (linux/amd64) kubernetes/5a00e38/leader-election,client:<IP-Blocked> (27-Jan-2021 09:18:20.126) (total time: 4227ms):
Jan 27 09:18:24 dev-ln02 k3s[2041650]: Trace[1043530145]: ---"About to write a response" 4227ms (09:18:00.353)
Jan 27 09:18:24 dev-ln02 k3s[2041650]: Trace[1043530145]: [4.227204684s] [4.227204684s] END
Jan 27 09:18:24 dev-ln02 k3s[2041650]: I0127 09:18:24.354467 2041650 trace.go:205] Trace[115965718]: "Get" url:/api/v1/namespaces/kube-system,user-agent:k3s/v1.19.7+k3s1 (linux/amd64) kubernetes/5a00e38,client:<IP-Blocked> (27-Jan-2021 09:18:18.045) (total time: 6309ms):
Jan 27 09:18:24 dev-ln02 k3s[2041650]: Trace[115965718]: ---"About to write a response" 6309ms (09:18:00.354)
Jan 27 09:18:24 dev-ln02 k3s[2041650]: Trace[115965718]: [6.309118885s] [6.309118885s] END
Jan 27 09:18:24 dev-ln02 k3s[2041650]: I0127 09:18:24.359666 2041650 trace.go:205] Trace[1917965253]: "Get" url:/apis/coordination.k8s.io/v1/namespaces/longhorn-system/leases/external-attacher-leader-driver-longhorn-io,user-agent:csi-attacher/v0.0.0 (linux/amd64) kubernetes/$Format,client:<IP-Blocked> (27-Jan-2021 09:18:02.461) (total time: 21898ms):
Jan 27 09:18:24 dev-ln02 k3s[2041650]: Trace[1917965253]: [21.898114512s] [21.898114512s] END
Jan 27 09:18:24 dev-ln02 k3s[2041650]: I0127 09:18:24.361205 2041650 trace.go:205] Trace[879600863]: "Patch" url:/api/v1/namespaces/longhorn-system/events/csi-attacher-5dcdcd5984-nv74n.165c840da6697f9d,user-agent:k3s/v1.19.7+k3s1 (linux/amd64) kubernetes/5a00e38,client:<IP-Blocked> (27-Jan-2021 09:18:07.442) (total time: 16919ms):
Jan 27 09:18:24 dev-ln02 k3s[2041650]: Trace[879600863]: [16.919064848s] [16.919064848s] END
Jan 27 09:18:24 dev-ln02 k3s[2041650]: I0127 09:18:24.362060 2041650 trace.go:205] Trace[370162849]: "Get" url:/api/v1/namespaces/kube-system/configmaps/k3s,user-agent:k3s/v1.19.7+k3s1 (linux/amd64) kubernetes/5a00e38,client:<IP-Blocked> (27-Jan-2021 09:17:54.925) (total time: 29436ms):
Jan 27 09:18:24 dev-ln02 k3s[2041650]: Trace[370162849]: [29.436226271s] [29.436226271s] END
Jan 27 09:18:24 dev-ln02 k3s[2041650]: E0127 09:18:24.363355 2041650 leaderelection.go:325] error retrieving resource lock kube-system/k3s: etcdserver: request timed out
Jan 27 09:18:24 dev-ln02 k3s[2041650]: W0127 09:18:24.428372 2041650 lease.go:233] Resetting endpoints for master service "kubernetes" to [<IP-Blocked> <IP-Blocked> <IP-Blocked>]
Jan 27 09:18:24 dev-ln02 k3s[2041650]: time="2021-01-27T09:18:24.438784443+01:00" level=info msg="Connecting to proxy" url="wss://<IP-Blocked>:6443/v1-k3s/connect"
Jan 27 09:18:24 dev-ln02 k3s[2041650]: time="2021-01-27T09:18:24.445602878+01:00" level=info msg="Handling backend connection request [dev-ln01]"
Jan 27 09:18:24 dev-ln02 k3s[2041650]: time="2021-01-27T09:18:24.448191633+01:00" level=info msg="Handling backend connection request [dev-ln03]"
Jan 27 09:18:24 dev-ln02 k3s[2041650]: time="2021-01-27T09:18:24.451446204+01:00" level=info msg="Handling backend connection request [dev-ln02]"
Jan 27 09:18:24 dev-ln02 k3s[2041650]: I0127 09:18:24.924949 2041650 leaderelection.go:278] failed to renew lease kube-system/k3s: timed out waiting for the condition
Jan 27 09:18:24 dev-ln02 k3s[2041650]: E0127 09:18:24.925536 2041650 leaderelection.go:301] Failed to release lock: resource name may not be empty
Jan 27 09:18:24 dev-ln02 k3s[2041650]: time="2021-01-27T09:18:24.925996568+01:00" level=fatal msg="leaderelection lost for k3s"
Jan 27 09:18:25 dev-ln02 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Jan 27 09:18:25 dev-ln02 systemd[1]: k3s.service: Failed with result 'exit-code'.
Jan 27 09:18:30 dev-ln02 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 5.
Jan 27 09:18:30 dev-ln02 systemd[1]: Stopped Lightweight Kubernetes.

BoBiene commented 3 years ago

@brandond so currently it seems like, if i deploy our application as a rancher app / helm chart the cluster goes down :(

Active: active (running) since Wed 2021-01-27 17:47:39 CET; 6min ago

Active: activating (auto-restart) (Result: exit-code) since Wed 2021-01-27 17:56:30 CET; 1s ago

Any suggestions? Should / can i move the embedded ETCD to a separate storage?

brandond commented 3 years ago

It looks like your application is generating more traffic or IO than the node can handle and it is causing the datastore latency to spike. Are you running workload pods on your K3s server nodes?

BoBiene commented 3 years ago

@brandond indeed i had workload on the K3S server nodes - i have fixed this with a taint on these nodes. But it seems not to run stable.

I'm wondering if i did the setup correctly, to create a cluster i do the following:

creating three servers
creating a Loadbalancer pointing to all the three servers handling port 6443
on the first server installing k3s with curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=v1.19 sh -s - server --cluster-init --token $CLUSTER_SECRET --disable traefik
on the other servers with curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=v1.19 sh -s - server --server "https://$LOADBALANCER_IP:6443" --token $CLUSTER_SECRET --disable traefik
joining worker nodes with curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=v1.19 sh -s agent --server "https://$LOADBALANCER_IP:6443" --token $CLUSTER_SECRET

So my "Fixed Registration Address" is the Loadbalancer and all three servers are part of the Loadbalancer (if they are online and serving from 6443)

Could this be the root issue?

brandond commented 3 years ago

That should be fine. The fixed registration address is only required to support agent registration; once a node has been added to the cluster it communicates directly with all servers. Either a load-balancer or DNS alias are fine.

What sort of instability are you seeing now that you have removed workloads from the control plane nodes?

stale[bot] commented 3 years ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

theAkito commented 3 years ago

Still relevant.

At my place I have a similar issue with k3s, but I couldn't yet confirm if it's a 100% the same issue. It definitely has the same symptoms, though: Crash after some time with error code 255.

brandond commented 3 years ago

@theAkito if you are seeing this error, please attach logs. Unless you are in the exact same situation as the original reporter (same host, same os, same errors in the logs, etc), you might be best off opening a new issue instead of just bumping this one.

stale[bot] commented 2 years ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

k3s-io / k3s

k3s.service: Main process exited, code=exited, status=255/EXCEPTION #2853