Redesign idea: "Central nervous system for Kured"

evrardjp commented 3 years ago

Hello everyone.

I guess this was supposed to be a blog post, but I will make it an issue instead. We can discuss it better here.

Intro

Our lovely kured works well for many. It's simple, and I would like to keep it that way.

Yet, we recently had more and more special cases, which requires building a bit more complexity around our code. Slowly, we'll increase our complexity, and reduce our readability.

Next to this, we have seen PRs and issues in the past mentioning shortcomings of our current architecture, for flexibility 1 or security reasons 3.

I would like us to think about what "another kured" (for more complex cases) could be. Here are my thoughts...

Separate kured daemonset in components

Instead of having a single daemonset doing everything, I would separate in multiple components:

An operator brain, analysing the cluster state, and orchestrating the Node Reboot Procedure(s)
A Node Reboot Detector, a user-defined daemonset, which will probably hold one or multiple containers.

The Brain

The brain would be based on the current code from rebootAsRequired. Instead of locking on the kured daemonset, it would lock on a separate object, like a config map. The rest would be like the current model: checking for any reboot blockers, draining the node (which requires patch node permissions), and triggering a reboot of one to multiple nodes (depending on the "Nodes Maintenance Window"). Because the reboot wouldn't be local anymore, the brain would create a job to trigger a reboot scheduled on the necessary node, job which can be described in the "Reboot Procedure".

Configuration of the behaviour of the brain would be done by editing custom resources and/or config maps.

The "Node reboot detector"

This daemonset will detect if the node needs rebooting, and simultaneously report it back to the "brain" and optionally expose that metric to prometheus. The latter is similar to our maintainRebootRequiredMetric. We can have a consistent period check between the brain and the detector by making sure both of them read period from the same config map/CR. Alternatively to "report it back to the brain", the brain could fetch the node information exposed with the rebootRequiredMetric, or the reboot detector could PATCH a node with the relevant information (maybe by relying on Node Admission controller by mounting kubelet's credentials). TBD.

The daemonset is likely to be a single container, which will check the presence of a file at an expected location /kured/reboot-required. This will only require read access to that volume. No need for privileged or hostPIDs.

For the basic ubuntu workloads, this is the same as having our daemonset, with a volume mapping /var/run to /kured/ in read-only mode.

For custom workloads, one user could define it's own daemonset containers, additional to our "standard one". The standard container would still check the presence of the file `/kured/reboot-required. However, in the custom case, that file could be mounted from a volume that's reachable from another container in the daemonset. That other container would implement its the own logic to detect if a reboot is required.

What do we gain?

Flexibility

The node detector structure allows the users to implement their own detection systems without any code change from kured. The brain architecture allows the users to implement their own reboot procedures without any code change from kured, and would allow more flexible maintenance windows.

Simplicity

The detector is now simpler, as it is just watching over a single file, on a read-only file system, with no extra privileges. The brain simplifies some code, as the coordination features like throttling will be easier to implement. However, we will have a more complex architecture with an operator.

Security

We can isolate components better, so we can tighten the access controls.

The detector daemonset would only need to access to:

the brain config map, in read only, to know the check period
(optionally) a volume to mount from host, in read-only mode.
a write access to a shared object (cm/cr) to expose the state of the node (or not, based on the above alternative choice).

The brain would need access to:

A custom lock object (improving security, see also 4)
The necessary access to drain/cordon/uncordon nodes
The necessary access to create jobs with the necessary rights based on user input (those reboot procedure might be privileged access!). I recently had success with a reboot on a systemd host using only the hostPID privileges!

Note: With the reboot procedure outside the kured code, we can think of using less privileged access: Using an API to reboot a node, using different reboot methods. This would also allow to use seccomp profiles in multiple places!

Scaling operations

By adding a NodesMaintenanceWindow CRD, a user can define multiple maintenances, including exceptional ones, by adding a new "Maintenance Window". This will be read by the brain when necessary. Creating a Custom resource will allow us to create new fields, like "Serialization", which would define how many nodes are to be rebooted at the same time for example.

evrardjp commented 3 years ago

WDYT of this?

evrardjp commented 3 years ago

@kfox1111 ^

kfox1111 commented 3 years ago

Sounds good to me.

I ended up writing part of something similar to this here: https://github.com/pnnl-miscscripts/miscscripts/tree/master/charts/charts/kubeupdater

Where it is just the node side of things, and annotates the node with a label when nodes needing upgrade, and accepts a label to upgrade/reboot the node after the node has been drained. It unlabels the node when its upgraded/rebooted ok. Its very particular to our node image though.

I intended to write something like the brain eventually but haven't gotten around to it. We're just doing that part manually for now.

So, yeah. I really like the idea of separating the components out like you describe.

evrardjp commented 3 years ago

I didn't determine the interface on how to announce the reboot to the brain. The easiest is to annotate a node if you already know where the kubeconfig is. I didn't like writing to etcd for such small thing, so I wanted an alternative to write status cm/CR/node annotation. I realised I could RPC to the brain, but then it becomes flooded regularly on large scale. I thought exposing a metric in the daemonset pod and scrape it with prom client but this is not easily done with a service on a ds. So the best alternative I found is the brain to know the ds , fetch all its pods, and query those one by one. Which lead me to 'why don't we ask the user to provide the kubeconfig', it's so easier.... (or guess it from an init) 🙂

evrardjp commented 3 years ago

For some edge cases, we might even think that the node reboot procedure should be have a flag to allow the brain to NOT drain the node for emergencies.

evrardjp commented 3 years ago

Current problem with the brain in this form: We now need the kubelet to be always available to be able to schedule pods into the node for a reboot. Kured can't be used for emergency reboots anymore.

bboreham commented 3 years ago

Broadly I like this. I don't like to increase the number of moving parts, but some features, e.g. prioritising nodes or rebooting N in parallel, are too hard to implement with all nodes autonomous.

One thing though: hasn't someone done this already? I'm sure I have discussed Kured in comparison to some other tool that had a central operator to manage reboots.

evrardjp commented 3 years ago

https://github.com/jamiehannaford/coreos-reboot-operator is archived, so is the tool superseding it ... Maybe there are others?

I like the idea of reuse. Any ideas?

bboreham commented 3 years ago

That does look very like the one I was thinking of, thanks. Looks like a more recently updated fork is at https://github.com/pantheon-systems/cos-update-operator

Did anyone ask in the Kubernetes Slack channels? Might get some more prior art there.

bboreham commented 3 years ago

I asked: https://kubernetes.slack.com/archives/C0BP8PW9G/p1618835934220900

evrardjp commented 3 years ago

Awesome, thanks @bboreham

github-actions[bot] commented 3 years ago

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

evrardjp commented 3 years ago

We have no news from the SIG node front. I didn't ask for more details there.

rptaylor commented 3 years ago

In this proposal the brain would launch jobs to reboot nodes. Along the lines of #416 , if there is a significant redesign, to me (as a security conscious cluster operator) it makes sense to separate the rebooting logic (brain/detector) from the reboot executor (which actually does the reboot) as much as possible. The logic entails code (which has the risk of bugs, supply chain attacks, and generally overall larger attack surface) but should ideally require zero privileges to run. The reboot executor on the other hand, entails elevated privileges, but should actually be as brainless as possible, asking only "should I reboot? yes/no". Ideally it would have enough privilege to do one and only one thing, gracefully reboot the node with systemd, but that is challenging to enforce in k8s.

One idea for this would be to have a simple drop-in cron job on the nodes which just queries a kured component (brain/detector) (possibly via curling a port) to see if a node should be rebooted. I realize some people may like a all-in-one helm deployment, but from a cluster admin's perspective it is very appealing to have a very simple cron job to reboot a node if/when kured says it should, because then all elevated privileges would be contained in a very simple bash one-liner that is fully under my control, and kured could be (possibly aside from reboot detection) unprivileged. Update: a better iteration of this idea is https://github.com/kubereboot/kured/issues/416#issuecomment-1825047502

evrardjp commented 3 years ago

I don't agree with the verbiage above.

Practically it makes sense to separate what runs to detect , what triggers a reboot , and what is effectively rebooting.

Your "executor" should not ask whether a reboot is required. This is because this can mostly be done without privileges, or with a different set of privileges than the actual rebooting .

What you are proposing with your cron job is not to far away of what I am proposing, with the disadvantage that you have to go through all nodes to adapt your cron job. You're right, we are here paying the price of having no configuration management that's active outside and inside kubernetes in an efficient way, that the whole community has agreed upon ;)

As a reminder I am proposing the following:

A detector, running with almost 0 privileges, custom modified by the user to be able to drop privileges if necessary (but by default shouldn't have much), talking then to the brain with information restricted to its own node.
A brain, coordinating the maintenance (=reboots), based on the input received from detectors. This would have a CRD to define what a maintenance is.
The brain would trigger kubernetes jobs, assigned to the right node, and run the maintenance. As the maintenance CRD defines partially a job definition, one could determine the privileges necessary, the image necessary, etc.

This would give far more flexibility to kured.

That would still work for your case where you want to run everything with no privileges from k8s perspective: The maintenance would simply drop a script to the host by mounting a certain folder, and that script (chmod-ed 555?) can then be executed by your cron job.

evrardjp commented 3 years ago

This would make the "next gen" basically a node maintenance tool, hence the conversation with the kubernetes node SIG. It wouldn't be too far away from the tools above or from https://github.com/rancher/system-upgrade-controller

github-actions[bot] commented 2 years ago

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

sftim commented 11 months ago

Potentially also relevant: https://github.com/kubernetes/enhancements/issues/4212

kubereboot / kured