kubereboot / kured

Kubernetes Reboot Daemon
https://kured.dev
Apache License 2.0
2.11k stars 200 forks source link

V1.15.1 does not seem to do any rebooting. #944

Open llyons opened 2 weeks ago

llyons commented 2 weeks ago

We have pulled down v1.15.1 of kured and installed on k3s v1.29.2. the cluster is a based on Alma Linux 9 machines.

We have it installed on control planes and workers. This is an unusual cluster in that there are 3 control planes in HA mode and 1 worker. machine 1,2,3 are control planes and 4 is worker

The configuration of the kured command is

command:
            - /usr/bin/kured
            - --reboot-sentinel=/sentinel/reboot-required
#            - --force-reboot=false
#            - --drain-grace-period=-1
#            - --skip-wait-for-delete-timeout=0
#            - --drain-delay=0
#            - --drain-timeout=0
#            - --drain-pod-selector=""
#            - --period=1h
#            - --ds-namespace=kube-system
#            - --ds-name=kured
#            - --lock-annotation=weave.works/kured-node-lock
#            - --lock-ttl=0
#            - --prometheus-url=http://prometheus.monitoring.svc.cluster.local/
#            - --alert-filter-regexp=^RebootRequired$
#            - --alert-filter-match-only=false
#            - --alert-firing-only=false
#            - --prefer-no-schedule-taint=""
#            - --reboot-sentinel-command=""
#            - --reboot-method=command
#            - --reboot-signal=39
#            - --slack-hook-url=https://hooks.slack.com/...
#            - --slack-username=prod
#            - --slack-channel=alerting
#            - --notify-url="" # See also shoutrrr url format
            - --message-template-drain=Draining node %s
            - --message-template-reboot=Rebooting node %s
            - --message-template-uncordon=Node %s rebooted & uncordoned successfully!
#            - --blocking-pod-selector=runtime=long,cost=expensive
#            - --blocking-pod-selector=name=temperamental
#            - --blocking-pod-selector=...
            - --reboot-days=sun,mon,tue,wed,thu,fri,sat
            - --reboot-delay=90s
            - --start-time=10pm
            - --end-time=1am
            - --time-zone=America/Chicago
#            - --annotate-nodes=false
#            - --lock-release-delay=30m
            - --log-format=text
#            - --metrics-host=""
#            - --metrics-port=8080
#            - --concurrency=1

i did the recommended test with sudo touch /var/run/reboot-required on a control plane node and a worker.

The pods are all running

kube-system      kured-8ms5k                                 1/1     Running   0               15h
kube-system      kured-rsf2n                                 1/1     Running   0               15h
kube-system      kured-vlntd                                 1/1     Running   0               15h
kube-system      kured-xrswc                                 1/1     Running   0               15h

The logs from the 2 machines that we did the reboot-required show this.

abal-kuber03

time="2024-06-20T21:43:42Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
time="2024-06-20T21:43:42Z" level=info msg="Kubernetes Reboot Daemon: 1.15.1"
time="2024-06-20T21:43:42Z" level=info msg="Node ID: abal-kuber03.olh.local"
time="2024-06-20T21:43:42Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2024-06-20T21:43:42Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2024-06-20T21:43:42Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting"
time="2024-06-20T21:43:42Z" level=info msg="PreferNoSchedule taint: "
time="2024-06-20T21:43:42Z" level=info msg="Blocking Pod Selectors: []"
time="2024-06-20T21:43:42Z" level=info msg="Reboot schedule: SunMonTueWedThuFriSat between 22:00 and 01:00 America/Chicago"
time="2024-06-20T21:43:42Z" level=info msg="Reboot check command: [test -f /sentinel/reboot-required] every 1h0m0s"
time="2024-06-20T21:43:42Z" level=info msg="Concurrency: 1"
time="2024-06-20T21:43:42Z" level=info msg="Reboot method: command"
time="2024-06-20T21:43:42Z" level=info msg="Reboot signal: 39"
time="2024-06-21T03:41:03Z" level=info msg="Reboot required"
time="2024-06-21T03:41:03Z" level=warning msg="Lock already held: abal-kuber04.olh.local"
time="2024-06-21T04:41:03Z" level=info msg="Reboot required"
time="2024-06-21T04:41:03Z" level=warning msg="Lock already held: abal-kuber04.olh.local"
time="2024-06-21T05:41:03Z" level=info msg="Reboot required"
time="2024-06-21T05:41:03Z" level=warning msg="Lock already held: abal-kuber04.olh.local"

abal-kuber04

.
.
.
evicting pod kong/kong-gateway-679bd4564c-kb7zk
evicting pod kong/kong-gateway-679bd4564c-xwnxk
error when evicting pods/"kong-gateway-679bd4564c-xwnxk" -n "kong" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pods/"kong-gateway-679bd4564c-kb7zk" -n "kong" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

it looks like it did disable scheduling on the lone worker node kuber04 and that was the state I found it in this morning.

What are we missing?

llyons commented 2 weeks ago

I went ahead and disabled the PDB in this case and will test again.

It seems like when it determined that it could not do kuber04 worker node because of the pdb, I would have thought it would eventually time out, uncordon the kuber04 worker and then move on to the kuber03 control plane.

Am I able to configure this kind of behavior? it would be preferable then leaving the worker node with scheduling disabled.