k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.08k stars 2.35k forks source link

Raspberry Pi (TuringPi) with Compute Module 3+ crashing v1.19.1+k3s1 #2353

Closed alexellis closed 3 years ago

alexellis commented 4 years ago

Environmental Info:

K3s Version: v1.19.1+k3s1

Node(s) CPU architecture, OS, and Version:

uname -a

Linux t-101 5.4.51-v7+ #1333 SMP Mon Aug 10 16:45:19 BST 2020 armv7l GNU/Linux

pi@t-101:~ $ sudo cat /boot/cmdline.txt 
console=serial0,115200 console=tty1 root=PARTUUID=89a57ea3-02 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory

Cluster Configuration:

7x nodes - 1x master

Describe the bug:

Screenshot 2020-10-06 at 10 32 10 Screenshot 2020-10-06 at 10 32 17 Screenshot 2020-10-06 at 10 32 23

Steps To Reproduce:

Installed k3s, inlets-operator and openfaas.

Expected behavior:

Stable runtime

Actual behavior:

To begin with everything worked as expected, then after a few hours I noticed the k3s service on the master crashing over and over again with the logs above.

Additional context / logs:

Oct 06 10:19:33 t-101 k3s[26504]: I1006 10:19:33.066137   26504 crd_finalizer.go:266] Starting CRDFinalizer
Oct 06 10:19:33 t-101 k3s[26504]: I1006 10:19:33.066259   26504 dynamic_cafile_content.go:167] Starting client-ca-bundle::/var/lib/rancher/k3s/server/tls/client-ca.crt
Oct 06 10:19:33 t-101 k3s[26504]: I1006 10:19:33.066351   26504 dynamic_cafile_content.go:167] Starting request-header::/var/lib/rancher/k3s/server/tls/request-header-ca.crt
Oct 06 10:19:33 t-101 k3s[26504]: I1006 10:19:33.514977   26504 trace.go:205] Trace[1696873226]: "Patch" url:/api/v1/nodes/t-103/status,user-agent:k3s/v1.19.1+k3s1 (linux/arm) kubernetes/b66760f,client:192.168.0.103 (06-Oct-2020 10:19:23.510) (total t
Oct 06 10:19:33 t-101 k3s[26504]: Trace[1696873226]: [10.00427016s] [10.00427016s] END
Oct 06 10:19:33 t-101 k3s[26504]: I1006 10:19:33.524475   26504 trace.go:205] Trace[621052234]: "GuaranteedUpdate etcd3" type:*core.Node (06-Oct-2020 10:19:23.510) (total time: 10013ms):
Oct 06 10:19:33 t-101 k3s[26504]: Trace[621052234]: [10.013355979s] [10.013355979s] END
Oct 06 10:19:33 t-101 k3s[26504]: I1006 10:19:33.827680   26504 controller.go:132] OpenAPI AggregationController: action for item : Nothing (removed from the queue).
Oct 06 10:19:33 t-101 k3s[26504]: I1006 10:19:33.827778   26504 controller.go:132] OpenAPI AggregationController: action for item k8s_internal_local_delegation_chain_0000000000: Nothing (removed from the queue).
Oct 06 10:19:34 t-101 k3s[26504]: time="2020-10-06T10:19:34.506692219+01:00" level=fatal msg="networkpolicies.networking.k8s.io is forbidden: User \"system:k3s-controller\" cannot list resource \"networkpolicies\" in API group \"networking.k8s.io\" at
Oct 06 10:19:35 t-101 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Oct 06 10:19:35 t-101 systemd[1]: k3s.service: Failed with result 'exit-code'.
Oct 06 10:19:35 t-101 systemd[1]: Failed to start Lightweight Kubernetes.
Oct 06 10:19:41 t-101 systemd[1]: k3s.service: Service RestartSec=5s expired, scheduling restart.
Oct 06 10:19:41 t-101 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 5884.
Oct 06 10:19:41 t-101 systemd[1]: Stopped Lightweight Kubernetes.
Oct 06 10:19:41 t-101 systemd[1]: k3s.service: Found left-over process 12067 (containerd-shim) in control group while starting unit. Ignoring.
Oct 06 10:19:41 t-101 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 06 10:19:41 t-101 systemd[1]: k3s.service: Found left-over process 12259 (containerd-shim) in control group while starting unit. Ignoring.
Oct 06 10:19:41 t-101 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
brandond commented 4 years ago

It appears that datastore operations are timing out after 10 seconds:

Oct 06 10:19:33 t-101 k3s[26504]: Trace[1696873226]: [10.00427016s] [10.00427016s] END
Oct 06 10:19:33 t-101 k3s[26504]: I1006 10:19:33.524475   26504 trace.go:205] Trace[621052234]: "GuaranteedUpdate etcd3" type:*core.Node (06-Oct-2020 10:19:23.510) (total time: 10013ms):
Oct 06 10:19:33 t-101 k3s[26504]: Trace[621052234]: [10.013355979s] [10.013355979s] END

While this can be caused by insufficient CPU resources, a more frequent cause is insufficient disk IO bandwidth. Low-end SD cards frequently cannot put up with the demand of hosting the datastore, workload, and image pulls at the same time. If you're already using USB storage, you might try putting the datastore (/var/lib/rancher/k3s/server/db) on a different device, or using an external SQL back-end.

erikwilson commented 4 years ago

This looks like it is hitting the 30 second timeout for the network policy controller d713683614fad5aa8cf6d3af467d81d4b6bdfda2 edit: it should probably be increased

brandond commented 4 years ago

@erikwilson yes that's where it's dying, but I'm generally suspicious of any datastore operation taking 10 seconds to complete. If it wasn't the NPC timing out, it'd be something else.

alexellis commented 4 years ago

The Compute Module 3+ has a 32GB eMMC rather than SD card but k3s has worked reasonably well up to this point with 1.17 on other similar RPi3s.

brandond commented 4 years ago

Are you running it with the default kine sqlite datastore?

You might leave iostat -xd 10 running for a while. Between the tps, await, and %util statistics you should be able to tell pretty easily if the storage is overloaded.

alexellis commented 4 years ago

I've set it to again using DigitalOcean mysql as the datastore with two servers and 5 agents. It's still running but some of the etcd writes are 3-6s+.

I'll have a look at iostat when converting back to SQLite

stale[bot] commented 3 years ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.