Closed nathandotleeathpe closed 1 year ago
Another, possibly coincidental/unrelated, bit of information was that talos-w2i-mhs
also hosted the VIP for the control plane right before etcd was stopped. This is the snippet of the dmesg log from the node ("...153" is the node, "...246" is the VIP):
*.*.*.153: user: warning: [2023-01-25T20:46:09.092628048Z]: [talos] removing shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "eth0", "ip": "*.*.*.246"}
*.*.*.153: user: warning: [2023-01-25T20:46:09.286423048Z]: [talos] removed address *.*.*.246/32 from "eth0" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
*.*.*.153: user: warning: [2023-01-25T20:46:09.625620048Z]: [talos] service[etcd](Finished): Service finished successfully
Note: this looks like a failure to properly order tasks - the node is not fully removed (thus extra etcd member), but that blocks other actions (?)
Yes. The node with the old version isn't removed. The way CAPI is configured, the cluster is scaled up by one node (from 3 to 4) before the old node is removed. The old node needs to be removed (scaled down from 4 to 3) before CAPI can repeat the process.
The same is happening to me. Using 3 control planes with a VIP, trying to upgrade via CAPI by changing k8s version in TalosControlPlane resource from 1.26.0
to 1.26.1
. I end up with 4 etcd members, and one is stuck in the Terminating stage, most likely due to the VIP being changed to a different node/nic and a service being temporarily unreachable.
Rebooting the "stuck" node solves the problem, but then again, this should be an ephemeral error.
Should be fixed in v0.5.0-alpha.2
I have 3 control plane nodes provisioned with Sidero Metal using Talos v1.2.7. The cluster is configured to use a VIP for the k8s API server. I am upgrading from k8s v1.25.0 to v1.25.6.
During the upgrade, CAPI stops etcd and the kube API server on node
talos-w2i-mhs
. Then a new, updated node joins the cluster and joins the etcd quorum.At this point I verified that
talos-tze-tdf
,talos-mc8-ms6
, andtalos-wdw-4wu
are listed as the etcd members.The CACPPT controller log shows this error: