Agent fails to update lease, 127.0.0.1:6444 timeout and "object has been modified"

ohthehugemanatee commented 3 years ago

Environmental Info: K3s Version: k3s version v1.21.3+k3s1 (1d1f220f) go version go1.16.6

Node(s) CPU architecture, OS, and Version: Linux airbernetes 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux (an old macbook air)

Cluster Configuration: 4 nodes, mixed arm/x64, one master on a raspberry pi 4(not this node)

Describe the bug: After a few hours, pods on this node can no longer reach kube DNS. Restarting the k3s-agent service resolves it.

journalctl -xe --unit=k3s-agent has these errors:

grep k3s syslog |grep error
Jul 29 00:18:34 airbernetes k3s[140351]: E0729 00:18:34.982922  140351 controller.go:187] failed to update lease, error: Put "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/airbernetes?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jul 29 00:45:38 airbernetes k3s[140351]: E0729 00:45:38.003907  140351 server.go:269] "Unable to authenticate the request due to an error" err="context canceled"
Jul 29 01:17:05 airbernetes k3s[140351]: time="2021-07-29T01:17:05.973637530Z" level=error msg="Remotedialer proxy error" error="read tcp 192.168.1.186:37814->192.168.1.41:6443: i/o timeout"
Jul 29 01:53:19 airbernetes k3s[140351]: E0729 01:53:19.374094  140351 controller.go:187] failed to update lease, error: Put "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/airbernetes?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jul 29 01:53:19 airbernetes k3s[140351]: E0729 01:53:19.791063  140351 controller.go:187] failed to update lease, error: Operation cannot be fulfilled on leases.coordination.k8s.io "airbernetes": the object has been modified; please apply your changes to the latest version and try again
Jul 29 02:03:45 airbernetes k3s[140351]: E0729 02:03:45.451252  140351 controller.go:187] failed to update lease, error: Put "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/airbernetes?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jul 29 02:51:34 airbernetes k3s[140351]: E0729 02:51:34.209306  140351 remote_runtime.go:144] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"f281332ab954a6b071752be56b79288563e8a26c96a185372606ec094f1f5b3a\": failed to Statfs \"/var/run/netns/cni-598e4228-480a-1963-9c56-61e666f33827\": no such file or directory" podSandboxID="f281332ab954a6b071752be56b79288563e8a26c96a185372606ec094f1f5b3a"
Jul 29 02:51:34 airbernetes k3s[140351]: E0729 02:51:34.209409  140351 kubelet.go:1594] error killing pod: failed to "KillPodSandbox" for "d7a0ff62-0324-4a4f-aa50-bae7d2490ccd" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"f281332ab954a6b071752be56b79288563e8a26c96a185372606ec094f1f5b3a\": failed to Statfs \"/var/run/netns/cni-598e4228-480a-1963-9c56-61e666f33827\": no such file or directory"
Jul 29 02:51:34 airbernetes k3s[140351]: E0729 02:51:34.209445  140351 pod_workers.go:190] "Error syncing pod, skipping" err="error killing pod: failed to \"KillPodSandbox\" for \"d7a0ff62-0324-4a4f-aa50-bae7d2490ccd\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"f281332ab954a6b071752be56b79288563e8a26c96a185372606ec094f1f5b3a\\\": failed to Statfs \\\"/var/run/netns/cni-598e4228-480a-1963-9c56-61e666f33827\\\": no such file or directory\"" pod="default/photoprism-index-27125400-sp9jf" podUID=d7a0ff62-0324-4a4f-aa50-bae7d2490ccd
Jul 29 03:03:44 airbernetes k3s[140351]: E0729 03:03:44.344404  140351 controller.go:187] failed to update lease, error: Put "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/airbernetes?timeout=10s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Jul 29 06:23:02 airbernetes k3s[140351]: E0729 06:23:02.256910  140351 controller.go:187] failed to update lease, error: Put "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/airbernetes?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

(192.168.1.186 is this problem node, 192.168.1.41 is the master node)

In the master node's log, I get this at the time of the Remotedialer proxy error (note airbernetes is on UTC and the master is on CET):

Jul 29 03:17:06 cluster1 k3s[19749]: time="2021-07-29T03:17:06.196207023+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Jul 29 03:17:06 cluster1 k3s[19749]: time="2021-07-29T03:17:06.211926159+02:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"

The lease update error on airbernetes also has corresponding errors on cluster1:

Jul 29 02:18:34 cluster1 k3s[19749]: E0729 02:18:34.989713   19749 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
Jul 29 02:18:34 cluster1 k3s[19749]: E0729 02:18:34.989778   19749 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: ht
Jul 29 02:18:35 cluster1 k3s[19749]: E0729 02:18:35.030187   19749 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
Jul 29 02:18:35 cluster1 k3s[19749]: I0729 02:18:35.032453   19749 trace.go:205] Trace[763916375]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/airbernetes,us
Jul 29 02:18:35 cluster1 k3s[19749]: Trace[763916375]: [9.978522359s] [9.978522359s] END
Jul 29 02:18:37 cluster1 k3s[19749]: I0729 02:18:37.381195   19749 trace.go:205] Trace[1136670744]: "GuaranteedUpdate etcd3" type:*coordination.Lease (29-Jul-2021 02:18:35.122) (total time:
Jul 29 02:18:37 cluster1 k3s[19749]: Trace[1136670744]: ---"Transaction committed" 2254ms (02:18:00.380)
Jul 29 02:18:37 cluster1 k3s[19749]: Trace[1136670744]: [2.258046799s] [2.258046799s] END
Jul 29 02:18:37 cluster1 k3s[19749]: I0729 02:18:37.381799   19749 trace.go:205] Trace[1976164949]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/airbernetes,u
Jul 29 02:18:37 cluster1 k3s[19749]: Trace[1976164949]: ---"Object stored in database" 2261ms (02:18:00.381)
Jul 29 02:18:37 cluster1 k3s[19749]: Trace[1976164949]: [2.261794771s] [2.261794771s] END

Steps To Reproduce: I can only reproduce on this one node! All other nodes, including another Ubuntu 20.04 node, are working fine.. Failing node's agent was installed with curl -sfL https://get.k3s.io | K3S_URL=https://cluster.vert:6443 K3S_TOKEN=foobar::server:foobar sh - .

On this one node, even uninstalling k3s and reinstalling doesn't stop the recurring problem.

Expected behavior:

A clear error message on the agent of WHY node lease renewal failed, or during which operation the timeout/context cancel occurred.
Maybe tuneable timeout?
The node should be marked as unhealthy when its pods can't reach cluster DNS.

Actual behavior: See the description. The node continues to receive pod assignments, which fail because nothing can reach the internal nameserver.

Additional context / logs:

2615 suggests a similar problem is caused by low disk space or disk speed. The failing node's root device seems fine in these regards:
```
$ df -h |grep sda2
/dev/sda2       229G   17G  201G   8% /
```

$ sudo hdparm -tT /dev/sda2

/dev/sda2: Timing cached reads: 10960 MB in 1.99 seconds = 5495.38 MB/sec Timing buffered disk reads: 364 MB in 3.02 seconds = 120.65 MB/sec

$ dd if=/dev/zero of=/tmp/test.bin bs=8k count=128k status=progress 1070653440 bytes (1.1 GB, 1021 MiB) copied, 4 s, 268 MB/s 131072+0 records in 131072+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.05224 s, 265 MB/s


The server node also seems OK for a raspberry pi mmc:

$ df -h |grep /dev/root /dev/root 59G 37G 20G 66% /

$ sudo hdparm -Tt /dev/mmcblk0

/dev/mmcblk0: Timing cached reads: 1526 MB in 2.00 seconds = 762.96 MB/sec HDIO_DRIVE_CMD(identify) failed: Invalid argument Timing buffered disk reads: 130 MB in 3.01 seconds = 43.15 MB/sec

$ dd if=/dev/zero of=/tmp/test.bin bs=8k count=128k status=progress 1068015616 bytes (1.1 GB, 1019 MiB) copied, 181 s, 5.9 MB/s 131072+0 records in 131072+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 181.717 s, 5.9 MB/s



**Backporting**

Unsure until the root problem is resolved. Likely not.

- [ ] Needs backporting to older releases

brandond commented 3 years ago

It looks like the agent is getting disconnected from the server. Is it on a wireless network or something?

ohthehugemanatee commented 3 years ago

nope, Gigabit on the same switch as all my other nodes. BUT it is a USB ethernet interface. And now that I dig for it, in dmesg I get usb 2-1-port1: disabled by hub (EMI?), re-enabling.., but rarely (once since July 28), and not coinciding with these disconnects.

What makes it look like the agent is getting disconnected? Is node leasing really so fragile that one disconnect every several days will break it invisibly?

brandond commented 3 years ago

It's really hard to diagnose a systemic issue from just a few log snippets. The websocket closure messages from the server node make me believe it's getting disconnected for long enough that TCP sessions are dropping - which would be pretty long. Have you tried leaving a ping going between the two nodes, to see if you see any periods of sustained packet loss?

ohthehugemanatee commented 3 years ago

understood that it's hard to diagnose complex systems like this on the scraps of info I can give you. I appreciate your help.

I already have monitoring (netdata) that yells at me when a node is disconnected, or even when tcp buffers are too filled... so I don't suspect a real disconnect. The disabled by hub (EMI?), re-enabling... maps to general flakiness in the USB driver apparently.

I'll monkey around with other adapters and kernel versions and see if something sticks.
I'll set up a custom netdata alert for when a ping fails to the master node.

Let me know if there are other logs I can get you!

ohthehugemanatee commented 3 years ago

Replacing the NIC device didn't help. On a whim I tried moving the database on the Master rpi4 to an external USB3 HDD, and... hey presto I've gone 48 hours without this problem!

Since running on RPI clusters is a significant use case for k3s, and I see in the issue queue that I'm not the only person who's run into storage speed problems... I suggest either some kind of check at install time for minimum DB storage speed, or perhaps some extra leeway with DB writes as a normal part of the configuration. Happy to help test whatever the maintainers decide with this.

stale[bot] commented 2 years ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

k3s-io / k3s

Agent fails to update lease, 127.0.0.1:6444 timeout and "object has been modified" #3726

2615 suggests a similar problem is caused by low disk space or disk speed. The failing node's root device seems fine in these regards: