siderolabs / cluster-api-bootstrap-provider-talos

A cluster-api bootstrap provider for deploying Talos clusters.
https://www.talos-systems.com
Mozilla Public License 2.0
103 stars 27 forks source link

Talos reset lifecycle hook #163

Closed Preisschild closed 1 year ago

Preisschild commented 1 year ago

Fixes: https://github.com/siderolabs/cluster-api-bootstrap-provider-talos/issues/159

This feature makes use of the CAPI pre-terminate hook, which is implemented in CAPI here.

The hook just waits until all annotations prefixed with pre-terminate.delete.hook.machine.cluster.x-k8s.io are removed before it allows the Machine to be deleted from the infrastructure provider (i.e.: VM is removed from cloud provider).

This MR does the following:

smira commented 1 year ago

We discussed this change internally, and even though it definitely resolves a real issue, we feel it might be not the right way:

This whole issue is a cluster-wide orchestration which can be done outside of CAPI scope in a separate controller.

One could watch Cluster and Machine resources, triggered on changes, doing the following reconciliation:

Preisschild commented 1 year ago

CACPPT controls the etcd leave process, ensuring that the cp machine leaves etcd; talosctl reset running concurrently also does etcd leave, and might lead to some surprises

Fortunately, this isn't an issue. CACPPT removes the node from etcd prior to a deletionTimestamp being set and thus before the reset request is sent. Been using this for a few months now, and I didn't have issues yet.

But yeah, I understand the rest. Maybe CAPI will provide a "standard" to handle bootstrap-provider specific cleanup tasks in the future.

smira commented 1 year ago

Fortunately, this isn't an issue. CACPPT removes the node from etcd prior to a deletionTimestamp being set and thus before the reset request is sent. Been using this for a few months now, and I didn't have issues yet.

This was actually wrong order, and we fixed it :) it's coming in the next release.

Another issue is the node being down/not accessible during termination... Should we wait? Should we not? Should we block machine deletion?

External controller can do clean up independent of the machine state.