kubereboot / kured

Kubernetes Reboot Daemon
https://kured.dev
Apache License 2.0
2.11k stars 200 forks source link

2 out of 8 nodes are not being rebooted (randomly ) #950

Open asafbennatan opened 5 days ago

asafbennatan commented 5 days ago

the issue : kured is not rebooting two nodes out 8

i am running a k3s cluster based on https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner

automatic updates are turned on. the current state:

customer-workload-agent-0-orj Ready,SchedulingDisabled

Annotations:    deprecated.daemonset.template.generation: 1
                weave.works/kured-node-lock:
                  {"nodeID":"customer-workload-agent-0-orj","metadata":{"unschedulable":false},"created":"2024-06-28T00:56:27.119697649Z","TTL":0}

the logs for the relevant pod running on the node:


time="2024-06-26T02:00:24Z" level=info msg="Reboot not required"
time="2024-06-26T02:05:24Z" level=info msg="Reboot required"
time="2024-06-26T02:05:24Z" level=warning msg="Lock already held: customer-workload-agent-0-orj"

additionally checking the logs for customer-workload-agent-0-orj node :

Jun 25 02:02:06 customer-workload-agent-0-orj k3s[1239]: I0625 02:02:06.839223    1239 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/configmap>
Jun 25 02:02:06 customer-workload-agent-0-orj k3s[1239]: I0625 02:02:06.840649    1239 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/empty-dir>
Jun 25 02:02:06 customer-workload-agent-0-orj k3s[1239]: I0625 02:02:06.840882    1239 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/configmap>
Jun 25 02:02:06 customer-workload-agent-0-orj k3s[1239]: I0625 02:02:06.937724    1239 reconciler_common.go:300] "Volume detached for volume \"health\" (UniqueName: \"kubernetes.io/co>
Jun 25 02:02:06 customer-workload-agent-0-orj k3s[1239]: I0625 02:02:06.937803    1239 reconciler_common.go:300] "Volume detached for volume \"config\" (UniqueName: \"kubernetes.io/co>
Jun 25 02:02:06 customer-workload-agent-0-orj k3s[1239]: I0625 02:02:06.937836    1239 reconciler_common.go:300] "Volume detached for volume \"data\" (UniqueName: \"kubernetes.io/empt>
Jun 25 02:02:08 customer-workload-agent-0-orj k3s[1239]: I0625 02:02:08.802147    1239 kubelet_volumes.go:163] "Cleaned up orphaned pod volumes dir" podUID="3cfc18c5-6a64-42c4-8e10-37>
Jun 28 00:54:20 customer-workload-agent-0-orj k3s[1239]: time="2024-06-28T00:54:20Z" level=error msg="Remotedialer proxy error; reconecting..." error="websocket: close 1006 (abnormal >
Jun 28 00:54:21 customer-workload-agent-0-orj k3s[1239]: time="2024-06-28T00:54:21Z" level=info msg="Connecting to proxy" url="wss://10.255.0.102:6443/v1-k3s/connect"

seems to be a weird gap in k3s-agent but i am not sure if this is related or not . looking at the general system log using journalctl -xb i dont see anything abnormal in the logs (node was up during that time)

in any case while i could manually fix this i rather resolve the underlaying issue.

a note : kured logs shows reboot required at 26th , but lock annotation is from 28th , additionally the k3s-agent logs are missing between the 25th and the 28th another note: current setup does not set a ttl for the lock

i would appreciate any help in solving this or suggestions how to get more info to solve this issue.