kubereboot / kured

Kubernetes Reboot Daemon
https://kured.dev
Apache License 2.0
2.16k stars 202 forks source link

Maintenance sentinel #314

Closed m4r1u2 closed 3 years ago

m4r1u2 commented 3 years ago

I would love to have an additional sentinel where kured cordan & drain a node (and uncordan afterwords).

Additional sentinel file can be: /var/run/kured-maintenance ??

Use case: If running attended upgrades or performing other tasks there its recommended to drain a node before start. The file must be manually created on the node (or by a script or program), and cleanup at the end. A cleanup will then trigger a uncordan of the node by kured.

  1. touch /var/run/kured-maintenance
  2. kured trigger drain of the node
  3. do the attended maintenance
  4. rm /var/run/kured-maintenance
  5. kured trigger uncordan of the node
jackfrancis commented 3 years ago

Hi @m4r1u2, I'm curious what kured maintenance tasks you think are appropriate for #3 ("do the attended maintenance"). Is it something other than a reboot?

m4r1u2 commented 3 years ago

This is not ment for maintenance of kured itself, but task done on the k8s node. One example can be upgrading docker to new version, debugging kernel/network on host without having any load, or just make it easy to draining a node from the host itself (without using kubectl)

jackfrancis commented 3 years ago

Thanks for clarifying, that makes sense.

My 2¢ are that this adds additional functionality and responsibility to what kured has historically done. The title of the project is "kured - The Kubernetes Reboot Daemon", so having kured be able to do convenient cordon + drain is sort of orthogonal to that mission.

@dholbach should we solicit some additional opinions and thoughts here?

Another problem I see with this approach is that the OS file system is not really the best deterministic vector to initiate a node's cordon + drain. By marking a sentinel file, and then relying upon a continuous process running as a privileged daemonset, you would also have to wait for that daemonset's container process to successfully cordon + drain before continuing with maintenance (cordon + drain isn't guarantee to be successful and/or graceful). So you'd probably need yet another sentinel file, like /var/run/cordoned-drained or something, which indicates to the underlying OS that indeed it is not an active node participating in the cluster.

I think the cleanest solution to this problem would be a generic "cordon + drain" daemonset (so, another tool, not kured) that is responsible for doing cordon + drain reconciliation, and then marking the OS so that maintenance tasks can be safely performed without impact to the cluster. Imagine a tool that does the following:

With a tool that exhibits the behaviors above, you could then write maintenance scripts like the below:

#!/bin/bash
# this script assumes we are running as root

touch /var/run/cordon-drain-required

while true; do
    # if I'm in maintenance mode then that means I'm not actively in the cluster
    if [ -f /var/run/k8s-node-maintenance ]; then
        # maintenance tasks here
        break
    else
        sleep 5
    fi
done

# if we get here then maintenance was successful
# mark the node as no longer in maintenance mode
rm -f /var/run/k8s-node-maintenance
# request the node re-join the cluster if the maintenance doesn't require a reboot
if [ ! -f /var/run/reboot-required ]; then
    touch /var/run/uncordon-required
fi

@Michael-Sinz in case you have any thoughts on doing something like the above

@devigned @CecileRobertMichon just curious, is there a canonical cluster-api-ish way of doing the above already?

Michael-Sinz commented 3 years ago

The whole question of who "owns" a node's cordon state (and drain behavior - they are separate but drain happens under cordon) is one that has plagued kubernetes administration for a while. We have seen other pod/service that take it upon themselves to cordon and uncordon nodes due to other signals (for example Azure infrastructure maintenance notifications) and that then interacts poorly with some other need to cordon a node as that pod undoes what another pod is intending.

Having some universal solution to this such that we can address the multi-master problem of different entities thinking it is time to cordon/drain or uncordon nodes would be great. I don't know if writing to the host filesystem is the right place to do that.

I would expect that we really want this to be an intention on the node in kubernetes and that may get set by various things.

The work on the node itself (below kubernetes) would then want to, as Jack showed, have some standard for knowing when it can or can not move forward. I tend to prefer systems with active locks - that is, you have to prevent it from doing something, such that failure mode is that maintenance will run (after all, it could be that maintenance is required to get back into operational readiness)

As Jack also correctly points out, drain may fail if pods on the node are in critical state (outside of pod disruption budget). In those conditions, you can only chose to loop back later and try again or force pods that are critical to be evicted which may have service impacts. The second case is useful if some pod has been preventing a drain from completing for a while now and you really must complete the reboot/maintenance process (security patch/etc) but it is really a path of last resort.

We (my team) have worked hard to minimize the number of ways maintenance is done at a node level such that we get less multi-master problems and generally push towards all real maintenance is part of the reboot cycle as that addresses the concerns of multiple different maintenance needs pending at the same time. (Which does happen).

This has reduced our places for cordon/drain to a few points: auto-scaler scaling down nodes, rebooter doing node reboots for maintenance reasons, and zombie-killer which try to resolve lost forward progress nodes. (Usually that just is cordon/drain in order to let kubernetes know that the node is unhealthy in a way it may not have noticed yet)

Even as such, that still leaves us with 3 and I don't know how to lower that number. (Zombie-killer could be seen as a way to scale out a node that is unhealthy - so one could call that "like" scale down from auto-scaler but it is still its own agent)

m4r1u2 commented 3 years ago

Nice comments and thanks for the feedback.

I totally understand your concerns of adding this feature to kured but I had to try :) I haven't found any good alternatives or any best practices on the topic.

Michael-Sinz commented 3 years ago

It is a "hard" topic as to who owns the "platform" upon which kubernetes runs. It is sort of abstract but the infrastructure is important as it provides the basis for kubernetes to do its work. Changes below kubernetes really are about updating that infrastructure. How to then make sure any assumptions/invariants are restored in the kubernetes space is the next tricky part.

An "easy" pattern is to assume that you have to reboot/restart most anything as you don't know the transitive closure of the dependencies that may have been impacted by the maintenance that was done. If you do know that exactly you could avoid the reboot by just restarting those elements involved.

We had gone down the route that changing the invariants below kubernetes is best done by a graceful reboot. We place our servicing work either in the boot process or the shutdown process (usually in the boot process) and just signal the system to do the reboot when able.

This simplifies the process and basically keeps things overall more stable (plus, reboots happen for other reasons anyway so having that handling path solid is important anyway so leveraging it is good)

jackfrancis commented 3 years ago

@m4r1u2, Last week I bootstrapped a prototype (initially heavily re-using kured concepts + code) solution here:

https://github.com/jackfrancis/kustodian

I wanted to do so to get a feel for whether or not incorporating it into kured (like, maybe making kured more generic and not reboot-specific) would be preferable.

My conclusions now are kind of the same, and that I really like that kured follows the unix principle of doing one thing well.

cc @bboreham who expressed interest in node maintenance leasing in another thread

I would really like to find that something like this is already solved and is community-backed, and has production users. Seems like this is so fundamental a generic solution already exists. But in case it's not the above is sort of my current best effort of how we might generically solve this problem as a k8s community.

m4r1u2 commented 3 years ago

Very cool @jackfrancis. I will do some testing this week

github-actions[bot] commented 3 years ago

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).