kubereboot / kured

Kubernetes Reboot Daemon
https://kured.dev
Apache License 2.0
2.17k stars 202 forks source link

Kured should verify if node is still drained when delaying reboot #929

Closed sebastiangaiser closed 1 month ago

sebastiangaiser commented 4 months ago

Hey,

we had a problem with two controllers draining/uncordon nodes. To make the reboot more graceful, we configured rebootDelay: 120s. During these 2 mins the other controller uncordened the node. Kured rebooted the not drained node afterwards which is not optimal :D So Kured should verify after the delay and right before the reboot is issued, if the node is still drained or not.

jackfrancis commented 4 months ago

I'm not sure there is a perfect solution for this. To a certain extent the "rebootDelay" functionality is unscientific: it makes no guarantees about outcomes after the delay. It seems this is generally well understood and folks use this feature as a "best-effort gesture" to better predict drain outcomes that may trail the immediate end of the drain process (though again, not guaranteed).

If there is another controller running on the system that observes cordoned nodes and uncordons them in response, then adding a follow-up drain gets us back to the original situation, where we may succeed in re-draining, but the workloads that are drained need a delay after in order to reboot gracefully; and now the other controller uncordons during that wait period; and so on and so on...

For this particular use case I think you want to disable that controller's power over kured-drained nodes (perhaps it can observe kured node annotations and skip uncordoning such nodes?), or consider not using this sleep option if there are multiple actors competing to cordon and uncordon nodes.

Are either of those options possible in your environment?

sebastiangaiser commented 4 months ago

The problem is definitely the other controller, not Kured. But despite the problem, how do you think about having this functionality as feature flag?

evrardjp commented 4 months ago

I agree to not have something specific in Kured, here the other controller is at fault. I don' t think kured should handle such special cases, this is code to maintain for nothing.

However, I think @sebastiangaiser has some kind of point: this behaviour is suboptimal.

Let's admit the drain/cordon is triggered (let's ignore the delay for the sake of the conversation). Does it make sense for kured to trigger the reboot of the node if (just before calling the reboot command) said node is not cordonned?

For me, there is something "weird" happening, and I would hope that kured would not reboot an uncordonned node. That should trigger a big error log somewhere.

What this means, in other words, is that kured is not responsible for what's outside his control, but expects a certain state before reboot. If you do mess it up (with an operator or a manual operation), that's your decision, but it should not allow kured to continue in that case.

The follow up question is whether this makes sense for everyone or gated behind a feature flag, which I would have a tendency to say feature flag, and do not change default behaviour.

evrardjp commented 4 months ago

@sebastiangaiser what do you think of the above?

sebastiangaiser commented 4 months ago

I'm also for the feature flag which can only be enabled when rebootDelay > 0

github-actions[bot] commented 2 months ago

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).