Closed evanrich closed 6 months ago
FWIW, resetting the node and upgrading it a second time worked. But thought I'd post this here since an out of the box 1.4.5->1.4.6 had an issue.
This issue doesn't have enough information to proceed, if the member wasn't removed, after a reboot a node should come back healthy. The root cause is not clear, it might be that some other machine was not healthy/ready at that point.
I encountered this exact same symptom. The only real difference is I had already done an upgrade to Kubernetes 1.27.3.
talosctl upgrade -n 192.168.1.82 --image ghcr.io/siderolabs/installer:v1.4.6 01:29 on Tue Jul 04
◱ watching nodes: [192.168.1.82]
* 192.168.1.82: 1 error(s) occurred:
sequence error: sequence failed: error running phase 4 in upgrade sequence: task 1/1: failed, failed to leave cluster: 1 error(s) occurred:
failed to remove member 13637399532505415876: etcdserver: server stopped
This was from a 3 node talos cluster on 1.4.5. There were the expected 3 ETCD members at the start of this upgrade. Once the error stopped the upgrade there were only 2 ETCD members. I had already completed one node upgrade in the sequence and everything had returned to healthy before continuing.
@cehoffman can you please provide the full log if that happens again (--debug
flag should be enough)?
or talosctl dmesg
before the reboot, so that we can figure out what goes wrong.
I'v never seen this error so far in the automated tests, so this might be something new. There is nothing specific to etcd
in 1.4.6, and the error happens on the node before the upgrade starts.
I'll keep that in mind. The next opportunity will be the next GA release.
I certainly hope it doesn't happen again. I'm now 1/4 for upgrade failures (this being the 3rd upgrade). I actually care somewhat about the nodes ephemeral disks due to running ceph as well on top of these master nodes, so wiping is slightly painful. On the face of it though, the error appears to be a race condition between detecting/receiving success of the ETCD member removal and detecting ETCD having stopped. I say this since ETCD membership is removed and the ETCD service is stopped, but Talos for whatever reason isn't prepared for ETCD to be stopped yet.
as for wiping, you can upgrade with --preserve
flag which doesn't wipe /var
partition
Oh absolutely wrt preserve, it is the failure mode here at issue. The failed upgrade requires (to my knowledge) a wipe to restore the node to a state that can process. Getting to maintenance mode to reapply the machine config to rejoin and allow further node upgrades in cluster is the only solution I’m aware of there.
I'm still a bit lost on the sequence of events.
With --preserve
, there's no etcd member removal process.
Without --preserve
, Talos should remove the member from etcd state, and then clean up the local member. If this happens, but upgrade fails after that, an automatic reboot will fix things to be back in order.
I did the upgrade here without preserve as I was trying to get a feel for what would be required/happen to Ceph in that situation. As for the restoration from failure, that did not happen. I waited for several minutes, completing a separate task while waiting, for connectivity or service to be restored. Unfortunately to get to a position where I could use my OOB management a hard reboot had happened and things did not progress without a wipe from there. The node under upgrade was in an orphan state as best as I could describe. It was very unexpected considering I was informed of the recovery procedures builtin.
As an UX thing. The CLI quit and did not wait for recovery. It might be nice to wait there for the restoration or remind of that fallback. I simply waited since the node was headless and I really didn’t want to have to embark on the physical steps necessary to restore OOB visibility. I had switched physical network ports to learn at this moment Intel AMT wasn’t available on the port I switched to.
I just tried to recreate this in various ways - upgrading an etcd non-leader first, then the leader, then when the leader was upgraded but another etcd member was not; with aborting the upgrade ... and could not. So more logs would be helpful, thanks!
I found this issue in the CI:
[ 108.395742] [talos] machine is running and ready {"component": "controller-runtime", "controller": "runtime.MachineStatusController"}
[ 292.044006] [talos] upgrade request received: preserve false, staged false, force false, reboot mode DEFAULT
[ 292.045343] [talos] validating "registry.dev.talos-systems.io/siderolabs/installer:v1.5.0-alpha.1-28-gce8495c10"
[ 297.083754] [talos] upgrade sequence: 15 phase(s)
[ 297.084519] [talos] phase drain (1/15): 1 tasks(s)
[ 297.085192] [talos] task cordonAndDrainNode (1/1): starting
[ 297.086735] [talos] task cordonAndDrainNode (1/1): waiting for node to be cordoned
[ 297.118860] [talos] skipping mirror pod kube-system/kube-scheduler-master-1
[ 297.119800] [talos] skipping mirror pod kube-system/kube-apiserver-master-1
[ 297.124023] [talos] skipping mirror pod kube-system/kube-controller-manager-master-1
[ 297.125310] [talos] skipping DaemonSet pod kube-system/kube-flannel-qjc5m
[ 297.126186] [talos] skipping DaemonSet pod kube-system/kube-proxy-kk7km
[ 297.127061] [talos] task cordonAndDrainNode (1/1): done, 41.871107ms
[ 297.128051] [talos] phase drain (1/15): done, 43.5338ms
[ 297.128720] [talos] phase cleanup (2/15): 1 tasks(s)
[ 297.129361] [talos] task removeAllPods (1/1): starting
[ 297.129993] [talos] task removeAllPods (1/1): waiting for kubelet lifecycle finalizers
[ 297.131196] [talos] task removeAllPods (1/1): shutting down kubelet gracefully
[ 297.311624] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "2 error(s) occurred:\n\tOperation cannot be fulfilled on nodes \"master-1\": the object has been modified; please apply your changes to the latest version and try again\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[ 297.690638] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[ 298.716217] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[ 300.360600] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[ 302.809501] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[ 304.274481] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[ 306.190312] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[ 307.151898] [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 2195, container kubelet)
[ 307.269251] [talos] service[kubelet](Finished): Service finished successfully
[ 307.289413] [talos] removing pod kube-system/kube-controller-manager-master-1 with network mode "NODE"
[ 307.291568] [talos] removing pod kube-system/kube-proxy-kk7km with network mode "NODE"
[ 307.292578] [talos] removing container kube-system/kube-controller-manager-master-1:kube-controller-manager
[ 307.293831] [talos] removing pod kube-system/kube-scheduler-master-1 with network mode "NODE"
[ 307.296636] [talos] removing pod kube-system/kube-flannel-qjc5m with network mode "NODE"
[ 307.297734] [talos] removing container kube-system/kube-scheduler-master-1:kube-scheduler
[ 307.298754] [talos] removing container kube-system/kube-proxy-kk7km:kube-proxy
[ 307.299848] [talos] removing pod kube-system/kube-apiserver-master-1 with network mode "NODE"
[ 307.301578] [talos] removing container kube-system/kube-flannel-qjc5m:install-cni
[ 307.302565] [talos] stopping container kube-system/kube-apiserver-master-1:kube-apiserver
[ 307.305168] [talos] removing container kube-system/kube-flannel-qjc5m:install-config
[ 307.310707] [talos] stopping container kube-system/kube-flannel-qjc5m:kube-flannel
[ 307.311812] [talos] removed container kube-system/kube-scheduler-master-1:kube-scheduler
[ 307.312826] [talos] removed container kube-system/kube-controller-manager-master-1:kube-controller-manager
[ 307.321500] [talos] removed container kube-system/kube-proxy-kk7km:kube-proxy
[ 307.322457] [talos] removed container kube-system/kube-flannel-qjc5m:install-cni
[ 307.325028] [talos] removed pod kube-system/kube-controller-manager-master-1
[ 307.326874] [talos] removed container kube-system/kube-flannel-qjc5m:install-config
[ 307.328756] [talos] removed pod kube-system/kube-proxy-kk7km
[ 307.329567] [talos] removed pod kube-system/kube-scheduler-master-1
[ 310.487979] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[ 317.412723] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[ 328.494210] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[ 337.455373] [talos] removing container kube-system/kube-flannel-qjc5m:kube-flannel
[ 337.456903] [talos] removing container kube-system/kube-apiserver-master-1:kube-apiserver
[ 337.460543] [talos] removed container kube-system/kube-flannel-qjc5m:kube-flannel
[ 337.465667] [talos] removed container kube-system/kube-apiserver-master-1:kube-apiserver
[ 337.528545] [talos] removed pod kube-system/kube-flannel-qjc5m
[ 337.551604] [talos] removed pod kube-system/kube-apiserver-master-1
[ 337.552832] [talos] task removeAllPods (1/1): done, 40.42539578s
[ 337.553881] [talos] phase cleanup (2/15): done, 40.427090925s
[ 337.554687] [talos] phase dbus (3/15): 1 tasks(s)
[ 337.555489] [talos] task stopDBus (1/1): starting
[ 337.556544] [talos] task stopDBus (1/1): done, 1.055451ms
[ 337.557284] [talos] phase dbus (3/15): done, 2.598738ms
[ 337.558404] [talos] phase leave (4/15): 1 tasks(s)
[ 337.559236] [talos] task leaveEtcd (1/1): starting
[ 337.969628] [talos] task leaveEtcd (1/1): failed: failed to leave cluster: 1 error(s) occurred:
[ 337.971171] failed to remove member 506074471964452443: etcdserver: server stopped
[ 337.972425] [talos] phase leave (4/15): failed
[ 337.973426] [talos] upgrade sequence: failed
[ 337.974472] [talos] upgrade failed: error running phase 4 in upgrade sequence: task 1/1: failed, failed to leave cluster: 1 error(s) occurred:
[ 337.976818] failed to remove member 506074471964452443: etcdserver: server stopped
[ 337.979266] [talos] service[apid](Stopping): Sending SIGTERM to task apid (PID 1655, container apid)
[ 337.980924] [talos] service[dashboard](Stopping): Sending SIGTERM to Process(["/sbin/dashboard"])
[ 337.982673] [talos] service[trustd](Stopping): Sending SIGTERM to task trustd (PID 2090, container trustd)
[ 337.984387] [talos] service[etcd](Stopping): Sending SIGTERM to task etcd (PID 2149, container etcd)
So the core issue seems to be that the cluster experienced a controlplane node reboot while performing an upgrade on another node, so the error is real: some members failed to process "removeMember" request, as they were being stopped.
Not sure what happened to your clusters @evanrich and @cehoffman, but upgrades should be strictly sequential on the controlplane nodes - only when all nodes are up a running, a single node at a time can be upgraded. Talos does some safety checks, but it can't prevent all kinds of problems.
What is more interesting is that in this case despite the fact that RemoveMember call failed, etcd still half-removed the member - it became removed on other nodes, which prevents it from starting back up.
Hmm. In my case I know positively there was no simultaneous upgrades. At this point I had done one upgrade just prior but did wait for the machine to reboot, come back, and restore functionality. I needed to wait for Ceph to return to OK from WARN. It was only after that point that I initiated the next upgrade from the same terminal session as the previous talosctl invocation.
The cluster in question for me is a 3 node setup, so I have other things I'm balancing to keep availability during upgrades. I have done several upgrades though since for custom kernel builds and have not experienced it again. All for 1.4.6->1.4.6.
Yep, it feels like an issue with etcd
itself potentially. I still don't have full info, but I believe in the integration tests that got reproduced because the tests proceeded incorrectly to perform an upgrade while one of the other nodes was rebooting.
I encountered this again, but this time while I was capturing the serial output. I've attached it here if it is of any help. failure to remove member.txt
are there any specific in a way this machine is configured? I see a custom installer, and lots of workloads running on a controlplane node as well. I will take a look at what we could do better here, but it's not clear what else might be going on there, as I see many pods failing to be stopped/removed, seems like a lot of moving parts.
The custom installer was for getting the yet unreleased UAS driver for USB drives https://github.com/siderolabs/pkgs/issues/751. It is running Ceph to serve cluster storage. I have noticed that the pod eviction process hangs/times out if the pods being evicted don't have any viable other nodes to move to. For example vault internal raft storage is a 3 member cluster just like ETCD. In this cluster there are only 3 machines, so there is no where to go for any 3 replica anti affinity deployments to place evicted pods.
This is also happening when upgrading from 1.4.8 to 1.5.3, however I've narrowed it down: it only happens when the node that is being upgraded currently has the configured VIP for the cluster allocated to it. If the node is rebooted the VIP moves to another node and it can be upgraded successfully.
@davidpanic interesting... do you have etcd members
output before the upgrade, does it have the VIP by chance as any of the addresses?
but that is good to know
Hah, funny that you say that, I did actually have watch -n 1 'talosctl -n 192.168.250.128 etcd members'
running during the upgrade.
The output seemed to stay like it is below, other than the node that was being upgraded getting removed. 192.168.250.128
is my VIP, which I know I'm not really supposed to use for talosctl commands, but for queries like this it allows me to keep it running in the background while the nodes are upgrading and still get output.
NODE ID HOSTNAME PEER URLS CLIENT URLS LEARNER
192.168.250.128 b0b856f97d24ed10 talos-test-2 https://192.168.250.2:2380 https://192.168.250.2:2379 false
192.168.250.128 d1a4d7081d9284c0 talos-test-1 https://192.168.250.1:2380 https://192.168.250.1:2379 false
192.168.250.128 d683e4ade2e0b3d6 talos-test-3 https://192.168.250.3:2380 https://192.168.250.3:2379 false
thanks, so I wonder if that's just somehow related to the fact that Talos used VIP address to connect to etcd. I will try to reproduce the issue
I am using cilium, that might have some impact on the behavior. These are my manifests:
Installation commands:
talosctl apply-config -n [node 1 DHCP IP] --insecure -f test/nodes/1.yaml
talosctl bootstrap -n 192.168.250.1
# Wait for bootstrap
helm install cilium cilium/cilium --namespace=kube-system -f cilium/values.yaml
# Wait for cilium pods to be running
talosctl apply-config -n [node 2 DHCP IP] --insecure -f test/nodes/2.yaml
talosctl apply-config -n [node 3 DHCP IP] --insecure -f test/nodes/3.yaml
The test/nodes/1.yaml
, test/nodes/2.yaml
and test/nodes/3.yaml
files only differ by machine.network.hostname
and machine.network.interfaces.addresses
. They are just set so that I can quickly apply them one after the other for testing.
I'm running the following upgrade command:
talosctl upgrade -n 192.168.250.1 --image ghcr.io/siderolabs/installer:v1.5.3 --wait --debug
thanks for providing so many details, I still couldn't reproduce this with main
branch.
it looks like in your config service discovery is disabled on Talos side, this is not recommended, but still works.
I tested with Cilium in strict mode and service discovery disabled. The VIP is removed before any actual upgrade actions start, and VIP is not used by etcd
, so it shouldn't really affect etcd operations in any way.
If you have any additional details (e.g. Talos logs?), I could try to dig more, but without a way to reproduce it's hard to move on.
Bug Report
Description
Have a 3 node cluster, 192.168.xxx.{10,11,12} Talos 1.4.5 K8s: 1.27.2
Ran the following command to update first node:
Logs
Environment
ETCD just starts/stops/starts/stops on the node, cannot re-run upgrade command on it either.