Closed ohthehugemanatee closed 2 years ago
It looks like the agent is getting disconnected from the server. Is it on a wireless network or something?
nope, Gigabit on the same switch as all my other nodes. BUT it is a USB ethernet interface. And now that I dig for it, in dmesg I get usb 2-1-port1: disabled by hub (EMI?), re-enabling..
, but rarely (once since July 28), and not coinciding with these disconnects.
What makes it look like the agent is getting disconnected? Is node leasing really so fragile that one disconnect every several days will break it invisibly?
It's really hard to diagnose a systemic issue from just a few log snippets. The websocket closure messages from the server node make me believe it's getting disconnected for long enough that TCP sessions are dropping - which would be pretty long. Have you tried leaving a ping going between the two nodes, to see if you see any periods of sustained packet loss?
understood that it's hard to diagnose complex systems like this on the scraps of info I can give you. I appreciate your help.
I already have monitoring (netdata) that yells at me when a node is disconnected, or even when tcp buffers are too filled... so I don't suspect a real disconnect. The disabled by hub (EMI?), re-enabling...
maps to general flakiness in the USB driver apparently.
Let me know if there are other logs I can get you!
Replacing the NIC device didn't help. On a whim I tried moving the database on the Master rpi4 to an external USB3 HDD, and... hey presto I've gone 48 hours without this problem!
Since running on RPI clusters is a significant use case for k3s, and I see in the issue queue that I'm not the only person who's run into storage speed problems... I suggest either some kind of check at install time for minimum DB storage speed, or perhaps some extra leeway with DB writes as a normal part of the configuration. Happy to help test whatever the maintainers decide with this.
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.
Environmental Info: K3s Version: k3s version v1.21.3+k3s1 (1d1f220f) go version go1.16.6
Node(s) CPU architecture, OS, and Version: Linux airbernetes 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux (an old macbook air)
Cluster Configuration: 4 nodes, mixed arm/x64, one master on a raspberry pi 4(not this node)
Describe the bug: After a few hours, pods on this node can no longer reach kube DNS. Restarting the k3s-agent service resolves it.
journalctl -xe --unit=k3s-agent
has these errors:(192.168.1.186 is this problem node, 192.168.1.41 is the master node)
In the master node's log, I get this at the time of the Remotedialer proxy error (note airbernetes is on UTC and the master is on CET):
The lease update error on airbernetes also has corresponding errors on cluster1:
Steps To Reproduce: I can only reproduce on this one node! All other nodes, including another Ubuntu 20.04 node, are working fine.. Failing node's agent was installed with
curl -sfL https://get.k3s.io | K3S_URL=https://cluster.vert:6443 K3S_TOKEN=foobar::server:foobar sh -
.On this one node, even uninstalling k3s and reinstalling doesn't stop the recurring problem.
Expected behavior:
Actual behavior: See the description. The node continues to receive pod assignments, which fail because nothing can reach the internal nameserver.
Additional context / logs:
2615 suggests a similar problem is caused by low disk space or disk speed. The failing node's root device seems fine in these regards:
$ sudo hdparm -tT /dev/sda2
/dev/sda2: Timing cached reads: 10960 MB in 1.99 seconds = 5495.38 MB/sec Timing buffered disk reads: 364 MB in 3.02 seconds = 120.65 MB/sec
$ dd if=/dev/zero of=/tmp/test.bin bs=8k count=128k status=progress 1070653440 bytes (1.1 GB, 1021 MiB) copied, 4 s, 268 MB/s 131072+0 records in 131072+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.05224 s, 265 MB/s
$ df -h |grep /dev/root /dev/root 59G 37G 20G 66% /
$ sudo hdparm -Tt /dev/mmcblk0
/dev/mmcblk0: Timing cached reads: 1526 MB in 2.00 seconds = 762.96 MB/sec HDIO_DRIVE_CMD(identify) failed: Invalid argument Timing buffered disk reads: 130 MB in 3.01 seconds = 43.15 MB/sec
$ dd if=/dev/zero of=/tmp/test.bin bs=8k count=128k status=progress 1068015616 bytes (1.1 GB, 1019 MiB) copied, 181 s, 5.9 MB/s 131072+0 records in 131072+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 181.717 s, 5.9 MB/s