controller doesn't remove deleted nodes from discovery service

siderolabs / cluster-api-bootstrap-provider-talos

A cluster-api bootstrap provider for deploying Talos clusters.

https://www.talos-systems.com

Mozilla Public License 2.0

103 stars 27 forks source link

controller doesn't remove deleted nodes from discovery service #159

Closed Preisschild closed 7 months ago

Preisschild commented 1 year ago

When nodes are removed (while doing a rollingUpgrade, for example) they are still in the members and kubespanpeerspecs resources, and thus kubespan still tries to connect to them.

This PR doesn't seem to fix this behavior in CAPI.

smira commented 1 year ago

Due to the architecture of CAPI, there's no way Talos node knows it is going to be removed (it might work for controlplane nodes, but not for worker nodes). Dead members will be cleaned up after 30 minutes. Dead Kubespan peers should not cause issues if the IPs don't overlap.

Preisschild commented 1 year ago

Ah, that's not great. Unfortunately Hetzner Cloud often reassigns the same IP to a new node which means that those nodes often require ~30mins before they are ready.

Not a major issue, but sure is annoying.

smira commented 1 year ago

One way - not a great one, but as a workaround, is to call reset on the node being removed. I think CAPI provides a set of webhooks which can be used for that, but that's not an easy fix.

Preisschild commented 1 year ago

This might be possible using the pre-terminate hook