Closed wilmardo closed 2 months ago
If considered I might take a stab at implementing this, I am unfamiliar with the codebase at the moment. So if this is considered to be worthy, I would love some pointers where to put what logic. That would (hopefully) be enough to get me started on a PR.
Hi @wilmardo, I understood your usecase and of course it would be possible to implement this in general. However, I would prefer not to implement this logic, as it would make more sense to me to find the rootcause instead.
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).
Sometimes we see this behavior with kured, a node gets rebooted but for some reason kubelet doesn't come up nicely and the node stays on
NotReady
andSchedulingDisabled
. Most of the time we just reboot the node ourselves and the node becomesReady
and kured continues on with the other nodes.Since the node is
NotReady
and still inSchedulingDisabled
it wouldn't hurt to simply retry the reboot to check if the node then becomesReady
. There should be some configured time beforeNotReady
is considered stuck and the amount the retryReboot is tried should be configurable and disabled by default.So in pseudocode something like this:
So when a node after the reboot doesn't become
Ready
for the configured threshold just reboot the node again.Would this be something to be considered and is possible to implement in kured? I get that is is somewhat hack to just reboot instead of finding the rootcause but this scene exists for a reason ;)