Open omniproc opened 8 months ago
We don't recommend hosting CAPI components in the cluster managed by the same CAPI setup. It is going to cause various issues.
@smira Thanks for the reply. Is that specifically mentioned somewhere in the docs? It's a setup supported by CAPI in general so what issues did you observe with it?
If your management cluster goes down for whatever reason, no easy way to recover. You can try this setup, but I would never recommend it.
Well, sure. But that's a general design flaw of CAPI. It's even worse then this because https://github.com/kubernetes-sigs/cluster-api/issues/7061 exists and it doesn't seem like there will be a fix for it anytime soon.
You could still use talosctl
to get the management cluster up and running again, couldn't you? Besides that: having a etcd backup and restore process is another unrelated requirement for production systems i'd argue.
I think the Issue could be fixed by deleting the machine prior to gratefulEtcdLeave
.
I think the Issue could be fixed by deleting the machine prior to
gratefulEtcdLeave
.The machine could be annotated with a CAPI
pre-terminate
lifecycle hook to block infraMachine deletion until gracefulEtcdLeave() is finished
I can confirm that the issue seems to be exactly that: the controller is waiting for the etcd to become healthy on 2 nodes (single control plane szenario in this case) which is only the case for a very short time. If the controller reconciles exactly during that time, the upgrade process will continue. Otherwise it will get stuck waiting for two nodes to become healthy while the old one is already being shut down:
controllers.TalosControlPlane verifying etcd health on all nodes {"node": "old", "node": "new"} controllers.TalosControlPlane rolling out control plane machines {"namespace": "default", "talosControlPlane": "xxx", "needRollout": ["new"]} controllers.TalosControlPlane waiting for etcd to become healthy before scaling down
https://github.com/kubernetes-sigs/cluster-api/issues/2651
It seems that the Kubeadm Controlplane Provider had the same issue, but they fixed it (by, as far as I understand, marking controlplane nodes where etcd was stopped as healthy and thus if the loop is triggered again, the machine gets deleted)
I noticed today that this problem occurs whenever the capi-system
capi-controller-manager
dpeloyment is restarted when there is a controlplane rollout in progress.
It doesn't matter which workload cluster is beeing rollouted.
When doing a rolling update under certain conditions the update will never finish. Steps to reproduce:
TalosControlPlane
resourceWhat happens:
TalosControlPlane
starts a rolling update by creating a newMachine
Machine
is created by whatever Infrastructure provider is usedTalosControlPlane
resource is unable to scale down to 1 and never deletes the oldMachine
of the old control-plane nodeHow to solve the problem:
Machine
of the old control-plane node. The used infrastructure provider then will handle the deletion of the node and theTalosControlPlane
resource will scale down to 1 and become ready again.What should happen:
Machine
resource of the old control-plane node.Note: this issue only happens if two conditions are met: