Control Plane node isn't returning to servers pool after being replaced during a k8s upgrade

nathandotleeathpe commented 1 year ago

I have 3 control plane nodes provisioned with Sidero Metal using Talos v1.2.7. The cluster is configured to use a VIP for the k8s API server. I am upgrading from k8s v1.25.0 to v1.25.6.

During the upgrade, CAPI stops etcd and the kube API server on node talos-w2i-mhs. Then a new, updated node joins the cluster and joins the etcd quorum.

$ kubectl get nodes
NAME            STATUS   ROLES           AGE   VERSION
talos-mc8-ms6   Ready    control-plane   19h   v1.25.6
talos-tze-tdf   Ready    control-plane   20h   v1.25.6
talos-w2i-mhs   Ready    control-plane   21h   v1.25.0
talos-wdw-4wu   Ready    control-plane   20h   v1.25.6

At this point I verified that talos-tze-tdf, talos-mc8-ms6, and talos-wdw-4wu are listed as the etcd members.

The CACPPT controller log shows this error:

1.6747530128447304e+09  ERROR   Reconciler error        {"controller": "taloscontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "TalosControlPlane", "TalosControlPlane": {"name":"leena-cp","namespace":"default"}, "namespace": "default", "name": "leena-cp", "reconcileID": "424d4068-cf3a-444f-a356-1273d578f871", "error": ": expected to have 4 members, got 3", "errorCauses": [{"error": ": expected to have 4 members, got 3", "errorCauses": [{"error": ": expected to have 4 members, got 3"}]}]}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /.cache/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /.cache/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /.cache/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:234

nathandotleeathpe commented 1 year ago

Another, possibly coincidental/unrelated, bit of information was that talos-w2i-mhs also hosted the VIP for the control plane right before etcd was stopped. This is the snippet of the dmesg log from the node ("...153" is the node, "...246" is the VIP):

*.*.*.153: user: warning: [2023-01-25T20:46:09.092628048Z]: [talos] removing shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "eth0", "ip": "*.*.*.246"}
*.*.*.153: user: warning: [2023-01-25T20:46:09.286423048Z]: [talos] removed address *.*.*.246/32 from "eth0" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
*.*.*.153: user: warning: [2023-01-25T20:46:09.625620048Z]: [talos] service[etcd](Finished): Service finished successfully

smira commented 1 year ago

Note: this looks like a failure to properly order tasks - the node is not fully removed (thus extra etcd member), but that blocks other actions (?)

nathandotleeathpe commented 1 year ago

Yes. The node with the old version isn't removed. The way CAPI is configured, the cluster is scaled up by one node (from 3 to 4) before the old node is removed. The old node needs to be removed (scaled down from 4 to 3) before CAPI can repeat the process.

inf0rmatiker commented 1 year ago

The same is happening to me. Using 3 control planes with a VIP, trying to upgrade via CAPI by changing k8s version in TalosControlPlane resource from 1.26.0 to 1.26.1. I end up with 4 etcd members, and one is stuck in the Terminating stage, most likely due to the VIP being changed to a different node/nic and a service being temporarily unreachable.

Rebooting the "stuck" node solves the problem, but then again, this should be an ephemeral error.

smira commented 1 year ago

Should be fixed in v0.5.0-alpha.2

siderolabs / cluster-api-control-plane-provider-talos

Control Plane node isn't returning to servers pool after being replaced during a k8s upgrade #150