Cluster checks and action hooks

Ulexus commented 2 years ago

Preface

Before reconstructing our upgrade operator, we need to build in better tooling to ensure that upgrades will not occur when user workloads are in an inconsistent state. Even when performing manual upgrades, this is a frequently-needed task which hithertofore is done by hand.

Cluster health

Various operations we perform in the course of maintenance of a Talos Cluster should rely on some automated checks of arbitrary user-supplied workloads in order to provide surety to the end user than an upgrade will not break their workloads. This will allow us to provide more reliable automated systems, such as upgrades, reboots, resets, etc. plugged into both the CLI and Sidero.

Health checks

There exists already CAPI's MachineHealthChecks, and we should absolutely utilise those. However, they do not go far enough, nor are they flexible enough for the types of checks and operations we need to perform from within the cluster.

For Talos, we should expand this to also have ClusterHealthChecks. These checks (along with MachineHealthChecks would be performed before any cluster-affecting action is performed.

Such actions include:

reboots
resets
shutdowns
upgrades
machine config changes

NOTE Obviously, the user should have the option to ignore the checks and force the commands, since those same commands are likely to be required to restore cluster health.

Action Hooks

In addition to health checks, there should also be action hooks.

An easy use case reference for these hooks is Rook/Ceph management:

Pre-upgrade action hook: disable OSD migration
Pre-reboot action hook: disable OSD migration (different upgrade/reboot hooks for user use-case and preference)
Pre-removal action hook: out-and-then-delete OSDs on a given nodes
Node-ready action hook: re-enable OSD migration

The associated actions will be blocked until any linked Job completes successfully.

NOTE An administrator should be able to bypass action hooks when performing a given action, as well.

`PreUpgradeJob.talos.dev`

Defines a Job which will be executed before a node upgrade is performed.

`PreRebootJob.talos.dev`

Defines a Job which will be executed before a node reboot is performed.

`PreShutdownJob.talos.dev`

Defines a Job which will be executed before a node shutdown is performed.

`NodePrereadyJob.talos.dev`

Defines a Job which will be executed before a node reports itself as Ready.

`NodeReadyJob.talos.dev`

Defines a Job which will be executed after a node reports itself as Ready.

Ulexus commented 2 years ago

An obvious question is: why do we need both checks and action hooks?

A check is non-blocking and stroboscopic, while a hook is blocking and potentially long-lived.

Checks are intended to be entirely impact-free, so they can be served via regular HTTP probes without much concern about authentication, access control, and the like.

Hooks are intended to be consequential. They should ideally be idempotent, but they are executed at the time of specific events, and they could potentially do any number of operations internally. As such, they need more restriction and nuance to their invokation.

Lastly, health checks are more general purpose. They are not bound to any particular event, which makes them more flexible and usable for other things.

smira commented 2 years ago

I wonder if we could do it more extensible, in a way that core hooks are very simple and called by Talos before the action is taken, while actual implementation could be no-op or some action, but it could be handled by some cluster controller running in the cluster.

Just as an example: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20200602-machine-deletion-phase-hooks.md

btw, I believe MachineHealthChecks are in the management cluster, not in the workload cluster. And we probably need something in the workload to keep it generic (independent of CAPI).

certainly :+1: on the idea

Ulexus commented 2 years ago

I think we should align the hook names to that proposal; it seems sound. I'm not clear on what you mean in terms of simplification, though. Certainly, too, the implementation of the Kubernetes=side resource monitoring and implementation of checks should be done within a Kubernetes-side controller to which Talos would communicate using a relatively simple protocol.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 3 months ago

This issue was closed because it has been stalled for 7 days with no activity.

siderolabs / talos