Closed alexellis closed 3 years ago
It appears that datastore operations are timing out after 10 seconds:
Oct 06 10:19:33 t-101 k3s[26504]: Trace[1696873226]: [10.00427016s] [10.00427016s] END
Oct 06 10:19:33 t-101 k3s[26504]: I1006 10:19:33.524475 26504 trace.go:205] Trace[621052234]: "GuaranteedUpdate etcd3" type:*core.Node (06-Oct-2020 10:19:23.510) (total time: 10013ms):
Oct 06 10:19:33 t-101 k3s[26504]: Trace[621052234]: [10.013355979s] [10.013355979s] END
While this can be caused by insufficient CPU resources, a more frequent cause is insufficient disk IO bandwidth. Low-end SD cards frequently cannot put up with the demand of hosting the datastore, workload, and image pulls at the same time. If you're already using USB storage, you might try putting the datastore (/var/lib/rancher/k3s/server/db) on a different device, or using an external SQL back-end.
This looks like it is hitting the 30 second timeout for the network policy controller d713683614fad5aa8cf6d3af467d81d4b6bdfda2 edit: it should probably be increased
@erikwilson yes that's where it's dying, but I'm generally suspicious of any datastore operation taking 10 seconds to complete. If it wasn't the NPC timing out, it'd be something else.
The Compute Module 3+ has a 32GB eMMC rather than SD card but k3s has worked reasonably well up to this point with 1.17 on other similar RPi3s.
Are you running it with the default kine sqlite datastore?
You might leave iostat -xd 10
running for a while. Between the tps, await, and %util statistics you should be able to tell pretty easily if the storage is overloaded.
I've set it to again using DigitalOcean mysql as the datastore with two servers and 5 agents. It's still running but some of the etcd writes are 3-6s+.
I'll have a look at iostat when converting back to SQLite
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.
Environmental Info:
K3s Version: v1.19.1+k3s1
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
7x nodes - 1x master
Describe the bug:
Steps To Reproduce:
Installed k3s, inlets-operator and openfaas.
Expected behavior:
Stable runtime
Actual behavior:
To begin with everything worked as expected, then after a few hours I noticed the k3s service on the master crashing over and over again with the logs above.
Additional context / logs: