Retry reboot when the node is NotReady and still has SchedulingDisabled after the first reboot

wilmardo commented 6 months ago

Sometimes we see this behavior with kured, a node gets rebooted but for some reason kubelet doesn't come up nicely and the node stays on NotReady and SchedulingDisabled. Most of the time we just reboot the node ourselves and the node becomes Ready and kured continues on with the other nodes.

Since the node is NotReady and still in SchedulingDisabled it wouldn't hurt to simply retry the reboot to check if the node then becomes Ready. There should be some configured time before NotReady is considered stuck and the amount the retryReboot is tried should be configurable and disabled by default.

So in pseudocode something like this:

if nodeNotReady && nodeSchedulingDisabled 
  if nodeRebootRetry.enabled && nodeNotReady for thresholdSeconds && retry < retryThreshold
    reboot()
    retry++;

So when a node after the reboot doesn't become Ready for the configured threshold just reboot the node again.

Would this be something to be considered and is possible to implement in kured? I get that is is somewhat hack to just reboot instead of finding the rootcause but this scene exists for a reason ;)

wilmardo commented 6 months ago

If considered I might take a stab at implementing this, I am unfamiliar with the codebase at the moment. So if this is considered to be worthy, I would love some pointers where to put what logic. That would (hopefully) be enough to get me started on a PR.

ckotzbauer commented 5 months ago

Hi @wilmardo, I understood your usecase and of course it would be possible to implement this in general. However, I would prefer not to implement this logic, as it would make more sense to me to find the rootcause instead.

github-actions[bot] commented 3 months ago

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

kubereboot / kured

Retry reboot when the node is NotReady and still has SchedulingDisabled after the first reboot #886