kubereboot / kured

Kubernetes Reboot Daemon
https://kured.dev
Apache License 2.0
2.17k stars 202 forks source link

Reboot delay between two nodes #874

Closed joysaha1994 closed 6 months ago

joysaha1994 commented 9 months ago

Hi Team,

I have deployed kured in AKS cluster which is configured in multi az .Inside AKS cluster user node pool has four nodes in it . My java based application pod has 4 replicas which have been deployed in each node with pod anti affinity config. Now my requirement is to put the reboot delay between each node when kured is managing to restart the node as kubernetes health probe for my application pod generally take 4 min to make pod healthy and I want to make sure zero downtime from application end during patching.

Can anyone please help how to achieve the reboot delay between two node from kured end so that after restart 1 node it will hold for 5 min to restart another node, in-between my application pod will be in ready state to accept the traffic .

ckotzbauer commented 9 months ago

The --drain-delay is what you are looking for: https://kured.dev/docs/configuration/

joysaha1994 commented 9 months ago

Thank you for this info..will it able to control reboot delay between two nodes as well? As I have mentioned my application pod has 4 replicas which are running in 4 individual node and kured pod is running at those node with daemonset. My requirement is after reboot one node kured will wait for 5 mins then reboot another node where I have configured ----concurrency=1 in kured daemonset.yaml.

ckotzbauer commented 9 months ago

This option is not specific a delay between reboots, but a delay between a node-lock acquire (when kured detected a needed reboot and no other node reboots at this time) and the actual draining of this node. So the flow is the following:

Reboot-Sentinel detected for Node 1
Node-Lock is acquired for Node 1
Wait for drain-delay (e.g. 5 mins)
Drain Node 1
Reboot Node 1
Node 1 comes back and your pod starts working again

Reboot-Sentinel detected for Node 2
Node-Lock is acquired for Node 2
Wait for drain-delay (e.g. 5 mins)
Drain Node 2
Reboot Node 2
Node 1 comes back and your pod starts working again
...

The drain (so the actual termination of one of your pods) is delayed by the given time. This time is used for recovery of your Daemonset.

github-actions[bot] commented 7 months ago

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).