Talos reset lifecycle hook

Preisschild commented 1 year ago

Fixes: https://github.com/siderolabs/cluster-api-bootstrap-provider-talos/issues/159

This feature makes use of the CAPI pre-terminate hook, which is implemented in CAPI here.

The hook just waits until all annotations prefixed with pre-terminate.delete.hook.machine.cluster.x-k8s.io are removed before it allows the Machine to be deleted from the infrastructure provider (i.e.: VM is removed from cloud provider).

This MR does the following:

adds the pre-terminate.delete.hook.machine.cluster.x-k8s.io/talos-reset: cabpt-controller annotation to the machines which are being provisioned
implements the machine_controller, which resets the Talos node when a machine is being deleted and removes the annotation after this is done, so that the VM can be deleted.
- If this fails due to any reason (node is down/not reachable) it will still remove the annotation and continue deleting the machine

smira commented 1 year ago

We discussed this change internally, and even though it definitely resolves a real issue, we feel it might be not the right way:

CACPPT controls the etcd leave process, ensuring that the cp machine leaves etcd; talosctl reset running concurrently also does etcd leave, and might lead to some surprises
CAPI hooks are supposed to be used by other components, not CAPI providers
this is not the bootstrap provider job; at the same time there is no better place across providers

This whole issue is a cluster-wide orchestration which can be done outside of CAPI scope in a separate controller.

One could watch Cluster and Machine resources, triggered on changes, doing the following reconciliation:

pull secrets, find Cluster ID & secret
using Discovery Service API, pull current members recorded in the discovery service
pull a list of Machines, reach out via Talos API to machines, find their member IDs
compare list of member IDs in Discovery Service with actual members via Talos API
clean up members in the discovery service which shouldn't exist

Preisschild commented 1 year ago

CACPPT controls the etcd leave process, ensuring that the cp machine leaves etcd; talosctl reset running concurrently also does etcd leave, and might lead to some surprises

Fortunately, this isn't an issue. CACPPT removes the node from etcd prior to a deletionTimestamp being set and thus before the reset request is sent. Been using this for a few months now, and I didn't have issues yet.

But yeah, I understand the rest. Maybe CAPI will provide a "standard" to handle bootstrap-provider specific cleanup tasks in the future.

smira commented 1 year ago

Fortunately, this isn't an issue. CACPPT removes the node from etcd prior to a deletionTimestamp being set and thus before the reset request is sent. Been using this for a few months now, and I didn't have issues yet.

This was actually wrong order, and we fixed it :) it's coming in the next release.

Another issue is the node being down/not accessible during termination... Should we wait? Should we not? Should we block machine deletion?

External controller can do clean up independent of the machine state.

siderolabs / cluster-api-bootstrap-provider-talos

Talos reset lifecycle hook #163