siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.83k stars 544 forks source link

upgrade to 1.4.6 breaks etcd #7423

Closed evanrich closed 6 months ago

evanrich commented 1 year ago

Bug Report

Description

Have a 3 node cluster, 192.168.xxx.{10,11,12} Talos 1.4.5 K8s: 1.27.2

Ran the following command to update first node:

talosctl upgrade -n 192.168.xxx.10 --image ghcr.io/siderolabs/installer:v1.4.6 

Logs

◳ watching nodes: [192.168.xxx.10]
    * 192.168.xxx.10: 1 error(s) occurred:
    sequence error: sequence failed: error running phase 4 in upgrade sequence: task 1/1: failed, failed to leave cluster: 1 error(s) occurred:
    failed to remove member 5218892699632570834: etcdserver: server stopped

Environment

ETCD just starts/stops/starts/stops on the node, cannot re-run upgrade command on it either.

evanrich commented 1 year ago

FWIW, resetting the node and upgrading it a second time worked. But thought I'd post this here since an out of the box 1.4.5->1.4.6 had an issue.

smira commented 1 year ago

This issue doesn't have enough information to proceed, if the member wasn't removed, after a reboot a node should come back healthy. The root cause is not clear, it might be that some other machine was not healthy/ready at that point.

cehoffman commented 1 year ago

I encountered this exact same symptom. The only real difference is I had already done an upgrade to Kubernetes 1.27.3.

talosctl upgrade -n 192.168.1.82 --image ghcr.io/siderolabs/installer:v1.4.6                                                                                                                                         01:29 on Tue Jul 04
◱ watching nodes: [192.168.1.82]
    * 192.168.1.82: 1 error(s) occurred:
    sequence error: sequence failed: error running phase 4 in upgrade sequence: task 1/1: failed, failed to leave cluster: 1 error(s) occurred:
    failed to remove member 13637399532505415876: etcdserver: server stopped

This was from a 3 node talos cluster on 1.4.5. There were the expected 3 ETCD members at the start of this upgrade. Once the error stopped the upgrade there were only 2 ETCD members. I had already completed one node upgrade in the sequence and everything had returned to healthy before continuing.

smira commented 1 year ago

@cehoffman can you please provide the full log if that happens again (--debug flag should be enough)?

or talosctl dmesg before the reboot, so that we can figure out what goes wrong.

I'v never seen this error so far in the automated tests, so this might be something new. There is nothing specific to etcd in 1.4.6, and the error happens on the node before the upgrade starts.

cehoffman commented 1 year ago

I'll keep that in mind. The next opportunity will be the next GA release.

I certainly hope it doesn't happen again. I'm now 1/4 for upgrade failures (this being the 3rd upgrade). I actually care somewhat about the nodes ephemeral disks due to running ceph as well on top of these master nodes, so wiping is slightly painful. On the face of it though, the error appears to be a race condition between detecting/receiving success of the ETCD member removal and detecting ETCD having stopped. I say this since ETCD membership is removed and the ETCD service is stopped, but Talos for whatever reason isn't prepared for ETCD to be stopped yet.

smira commented 1 year ago

as for wiping, you can upgrade with --preserve flag which doesn't wipe /var partition

cehoffman commented 1 year ago

Oh absolutely wrt preserve, it is the failure mode here at issue. The failed upgrade requires (to my knowledge) a wipe to restore the node to a state that can process. Getting to maintenance mode to reapply the machine config to rejoin and allow further node upgrades in cluster is the only solution I’m aware of there.

smira commented 1 year ago

I'm still a bit lost on the sequence of events.

With --preserve, there's no etcd member removal process.

Without --preserve, Talos should remove the member from etcd state, and then clean up the local member. If this happens, but upgrade fails after that, an automatic reboot will fix things to be back in order.

cehoffman commented 1 year ago

I did the upgrade here without preserve as I was trying to get a feel for what would be required/happen to Ceph in that situation. As for the restoration from failure, that did not happen. I waited for several minutes, completing a separate task while waiting, for connectivity or service to be restored. Unfortunately to get to a position where I could use my OOB management a hard reboot had happened and things did not progress without a wipe from there. The node under upgrade was in an orphan state as best as I could describe. It was very unexpected considering I was informed of the recovery procedures builtin.

As an UX thing. The CLI quit and did not wait for recovery. It might be nice to wait there for the restoration or remind of that fallback. I simply waited since the node was headless and I really didn’t want to have to embark on the physical steps necessary to restore OOB visibility. I had switched physical network ports to learn at this moment Intel AMT wasn’t available on the port I switched to.

steverfrancis commented 1 year ago

I just tried to recreate this in various ways - upgrading an etcd non-leader first, then the leader, then when the leader was upgraded but another etcd member was not; with aborting the upgrade ... and could not. So more logs would be helpful, thanks!

smira commented 1 year ago

I found this issue in the CI:

[  108.395742] [talos] machine is running and ready {"component": "controller-runtime", "controller": "runtime.MachineStatusController"}
[  292.044006] [talos] upgrade request received: preserve false, staged false, force false, reboot mode DEFAULT
[  292.045343] [talos] validating "registry.dev.talos-systems.io/siderolabs/installer:v1.5.0-alpha.1-28-gce8495c10"
[  297.083754] [talos] upgrade sequence: 15 phase(s)
[  297.084519] [talos] phase drain (1/15): 1 tasks(s)
[  297.085192] [talos] task cordonAndDrainNode (1/1): starting
[  297.086735] [talos] task cordonAndDrainNode (1/1): waiting for node to be cordoned
[  297.118860] [talos] skipping mirror pod kube-system/kube-scheduler-master-1
[  297.119800] [talos] skipping mirror pod kube-system/kube-apiserver-master-1
[  297.124023] [talos] skipping mirror pod kube-system/kube-controller-manager-master-1
[  297.125310] [talos] skipping DaemonSet pod kube-system/kube-flannel-qjc5m
[  297.126186] [talos] skipping DaemonSet pod kube-system/kube-proxy-kk7km
[  297.127061] [talos] task cordonAndDrainNode (1/1): done, 41.871107ms
[  297.128051] [talos] phase drain (1/15): done, 43.5338ms
[  297.128720] [talos] phase cleanup (2/15): 1 tasks(s)
[  297.129361] [talos] task removeAllPods (1/1): starting
[  297.129993] [talos] task removeAllPods (1/1): waiting for kubelet lifecycle finalizers
[  297.131196] [talos] task removeAllPods (1/1): shutting down kubelet gracefully
[  297.311624] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "2 error(s) occurred:\n\tOperation cannot be fulfilled on nodes \"master-1\": the object has been modified; please apply your changes to the latest version and try again\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[  297.690638] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[  298.716217] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[  300.360600] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[  302.809501] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[  304.274481] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[  306.190312] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[  307.151898] [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 2195, container kubelet)
[  307.269251] [talos] service[kubelet](Finished): Service finished successfully
[  307.289413] [talos] removing pod kube-system/kube-controller-manager-master-1 with network mode "NODE"
[  307.291568] [talos] removing pod kube-system/kube-proxy-kk7km with network mode "NODE"
[  307.292578] [talos] removing container kube-system/kube-controller-manager-master-1:kube-controller-manager
[  307.293831] [talos] removing pod kube-system/kube-scheduler-master-1 with network mode "NODE"
[  307.296636] [talos] removing pod kube-system/kube-flannel-qjc5m with network mode "NODE"
[  307.297734] [talos] removing container kube-system/kube-scheduler-master-1:kube-scheduler
[  307.298754] [talos] removing container kube-system/kube-proxy-kk7km:kube-proxy
[  307.299848] [talos] removing pod kube-system/kube-apiserver-master-1 with network mode "NODE"
[  307.301578] [talos] removing container kube-system/kube-flannel-qjc5m:install-cni
[  307.302565] [talos] stopping container kube-system/kube-apiserver-master-1:kube-apiserver
[  307.305168] [talos] removing container kube-system/kube-flannel-qjc5m:install-config
[  307.310707] [talos] stopping container kube-system/kube-flannel-qjc5m:kube-flannel
[  307.311812] [talos] removed container kube-system/kube-scheduler-master-1:kube-scheduler
[  307.312826] [talos] removed container kube-system/kube-controller-manager-master-1:kube-controller-manager
[  307.321500] [talos] removed container kube-system/kube-proxy-kk7km:kube-proxy
[  307.322457] [talos] removed container kube-system/kube-flannel-qjc5m:install-cni
[  307.325028] [talos] removed pod kube-system/kube-controller-manager-master-1
[  307.326874] [talos] removed container kube-system/kube-flannel-qjc5m:install-config
[  307.328756] [talos] removed pod kube-system/kube-proxy-kk7km
[  307.329567] [talos] removed pod kube-system/kube-scheduler-master-1
[  310.487979] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[  317.412723] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[  328.494210] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/master-1?timeout=30s\": dial tcp [::1]:6443: connect: connection refused"}
[  337.455373] [talos] removing container kube-system/kube-flannel-qjc5m:kube-flannel
[  337.456903] [talos] removing container kube-system/kube-apiserver-master-1:kube-apiserver
[  337.460543] [talos] removed container kube-system/kube-flannel-qjc5m:kube-flannel
[  337.465667] [talos] removed container kube-system/kube-apiserver-master-1:kube-apiserver
[  337.528545] [talos] removed pod kube-system/kube-flannel-qjc5m
[  337.551604] [talos] removed pod kube-system/kube-apiserver-master-1
[  337.552832] [talos] task removeAllPods (1/1): done, 40.42539578s
[  337.553881] [talos] phase cleanup (2/15): done, 40.427090925s
[  337.554687] [talos] phase dbus (3/15): 1 tasks(s)
[  337.555489] [talos] task stopDBus (1/1): starting
[  337.556544] [talos] task stopDBus (1/1): done, 1.055451ms
[  337.557284] [talos] phase dbus (3/15): done, 2.598738ms
[  337.558404] [talos] phase leave (4/15): 1 tasks(s)
[  337.559236] [talos] task leaveEtcd (1/1): starting
[  337.969628] [talos] task leaveEtcd (1/1): failed: failed to leave cluster: 1 error(s) occurred:
[  337.971171]  failed to remove member 506074471964452443: etcdserver: server stopped
[  337.972425] [talos] phase leave (4/15): failed
[  337.973426] [talos] upgrade sequence: failed
[  337.974472] [talos] upgrade failed: error running phase 4 in upgrade sequence: task 1/1: failed, failed to leave cluster: 1 error(s) occurred:
[  337.976818]  failed to remove member 506074471964452443: etcdserver: server stopped
[  337.979266] [talos] service[apid](Stopping): Sending SIGTERM to task apid (PID 1655, container apid)
[  337.980924] [talos] service[dashboard](Stopping): Sending SIGTERM to Process(["/sbin/dashboard"])
[  337.982673] [talos] service[trustd](Stopping): Sending SIGTERM to task trustd (PID 2090, container trustd)
[  337.984387] [talos] service[etcd](Stopping): Sending SIGTERM to task etcd (PID 2149, container etcd)

https://ci.dev.talos-systems.io/siderolabs/talos/35015/3/6

smira commented 1 year ago

So the core issue seems to be that the cluster experienced a controlplane node reboot while performing an upgrade on another node, so the error is real: some members failed to process "removeMember" request, as they were being stopped.

Not sure what happened to your clusters @evanrich and @cehoffman, but upgrades should be strictly sequential on the controlplane nodes - only when all nodes are up a running, a single node at a time can be upgraded. Talos does some safety checks, but it can't prevent all kinds of problems.

What is more interesting is that in this case despite the fact that RemoveMember call failed, etcd still half-removed the member - it became removed on other nodes, which prevents it from starting back up.

cehoffman commented 1 year ago

Hmm. In my case I know positively there was no simultaneous upgrades. At this point I had done one upgrade just prior but did wait for the machine to reboot, come back, and restore functionality. I needed to wait for Ceph to return to OK from WARN. It was only after that point that I initiated the next upgrade from the same terminal session as the previous talosctl invocation.

The cluster in question for me is a 3 node setup, so I have other things I'm balancing to keep availability during upgrades. I have done several upgrades though since for custom kernel builds and have not experienced it again. All for 1.4.6->1.4.6.

smira commented 1 year ago

Yep, it feels like an issue with etcd itself potentially. I still don't have full info, but I believe in the integration tests that got reproduced because the tests proceeded incorrectly to perform an upgrade while one of the other nodes was rebooting.

cehoffman commented 1 year ago

I encountered this again, but this time while I was capturing the serial output. I've attached it here if it is of any help. failure to remove member.txt

smira commented 1 year ago

are there any specific in a way this machine is configured? I see a custom installer, and lots of workloads running on a controlplane node as well. I will take a look at what we could do better here, but it's not clear what else might be going on there, as I see many pods failing to be stopped/removed, seems like a lot of moving parts.

cehoffman commented 1 year ago

The custom installer was for getting the yet unreleased UAS driver for USB drives https://github.com/siderolabs/pkgs/issues/751. It is running Ceph to serve cluster storage. I have noticed that the pod eviction process hangs/times out if the pods being evicted don't have any viable other nodes to move to. For example vault internal raft storage is a 3 member cluster just like ETCD. In this cluster there are only 3 machines, so there is no where to go for any 3 replica anti affinity deployments to place evicted pods.

davidpanic commented 1 year ago

This is also happening when upgrading from 1.4.8 to 1.5.3, however I've narrowed it down: it only happens when the node that is being upgraded currently has the configured VIP for the cluster allocated to it. If the node is rebooted the VIP moves to another node and it can be upgraded successfully.

smira commented 1 year ago

@davidpanic interesting... do you have etcd members output before the upgrade, does it have the VIP by chance as any of the addresses?

but that is good to know

davidpanic commented 1 year ago

Hah, funny that you say that, I did actually have watch -n 1 'talosctl -n 192.168.250.128 etcd members' running during the upgrade.

The output seemed to stay like it is below, other than the node that was being upgraded getting removed. 192.168.250.128 is my VIP, which I know I'm not really supposed to use for talosctl commands, but for queries like this it allows me to keep it running in the background while the nodes are upgrading and still get output.

NODE              ID                 HOSTNAME       PEER URLS                     CLIENT URLS                   LEARNER
192.168.250.128   b0b856f97d24ed10   talos-test-2   https://192.168.250.2:2380   https://192.168.250.2:2379   false
192.168.250.128   d1a4d7081d9284c0   talos-test-1   https://192.168.250.1:2380   https://192.168.250.1:2379   false
192.168.250.128   d683e4ade2e0b3d6   talos-test-3   https://192.168.250.3:2380   https://192.168.250.3:2379   false
smira commented 1 year ago

thanks, so I wonder if that's just somehow related to the fact that Talos used VIP address to connect to etcd. I will try to reproduce the issue

davidpanic commented 1 year ago

I am using cilium, that might have some impact on the behavior. These are my manifests:

Talos node configuration (click to expand) ```yaml version: v1alpha1 debug: false persist: true machine: type: controlplane token: "[REDACTED]" ca: crt: "[REDACTED]" key: "[REDACTED]" certSANs: [] kubelet: image: ghcr.io/siderolabs/kubelet:v1.27.1 defaultRuntimeSeccompProfileEnabled: true disableManifestsDirectory: true extraArgs: feature-gates: NewVolumeManagerReconstruction=false extraConfig: maxPods: 250 kubeReserved: cpu: "1" memory: 2Gi ephemeral-storage: 1Gi evictionHard: memory.available: 500Mi network: hostname: talos-test-1 interfaces: - interface: eth0 vip: ip: 192.168.250.128 addresses: - 192.168.250.1/24 routes: - network: 0.0.0.0/0 gateway: 192.168.250.254 nameservers: - 1.1.1.1 - 1.0.0.1 - 8.8.8.8 install: disk: /dev/vda image: ghcr.io/siderolabs/installer:v1.4.8 bootloader: true wipe: false registries: {} features: rbac: true apidCheckExtKeyUsage: true kubernetesTalosAPIAccess: enabled: true allowedRoles: - os:etcd:backup allowedKubernetesNamespaces: - kube-system time: disabled: false servers: - time.cloudflare.com sysctls: net.core.rmem_max: "2500000" net.core.wmem_max: "2500000" cluster: id: "[REDACTED]" secret: "[REDACTED]" controlPlane: endpoint: https://192.168.250.128:6443 clusterName: d-test allowSchedulingOnControlPlanes: true network: dnsDomain: cluster.local podSubnets: - 10.245.0.0/16 serviceSubnets: - 10.112.0.0/12 cni: name: none token: "[REDACTED]" secretboxEncryptionSecret: "[REDACTED]" ca: crt: "[REDACTED]" key: "[REDACTED]" aggregatorCA: crt: "[REDACTED]" key: "[REDACTED]" serviceAccount: key: "[REDACTED]" apiServer: image: registry.k8s.io/kube-apiserver:v1.27.1 certSANs: - 192.168.250.128 disablePodSecurityPolicy: true admissionControl: - name: PodSecurity configuration: apiVersion: pod-security.admission.config.k8s.io/v1alpha1 defaults: audit: restricted audit-version: latest enforce: baseline enforce-version: latest warn: restricted warn-version: latest exemptions: namespaces: - kube-system runtimeClasses: [] usernames: [] kind: PodSecurityConfiguration auditPolicy: apiVersion: audit.k8s.io/v1 kind: Policy rules: - level: Metadata controllerManager: image: registry.k8s.io/kube-controller-manager:v1.27.1 proxy: disabled: true scheduler: image: registry.k8s.io/kube-scheduler:v1.27.1 discovery: enabled: true registries: kubernetes: disabled: true service: {} etcd: ca: crt: "[REDACTED]" key: "[REDACTED]" extraManifests: [] inlineManifests: [] ```
Cilium values.yaml (click to expand) ```yaml upgradeCompatibility: '1.14' ipam: mode: kubernetes kubeProxyReplacement: true k8sServiceHost: 192.168.250.128 k8sServicePort: "6443" rollOutCiliumPods: true bandwidthManager: enabled: true bbr: true routingMode: native autoDirectNodeRoutes: true ipv4NativeRoutingCIDR: 0.0.0.0/32 securityContext: privileged: true bpf: masquerade: true bgpControlPlane: enabled: true loadBalancer: mode: dsr cgroup: autoMount: enabled: false hostRoot: /sys/fs/cgroup hubble: enabled: true relay: enabled: true ui: enabled: true prometheus: enabled: false serviceMonitor: enabled: false operator: prometheus: enabled: false serviceMonitor: enabled: false dashboards: enabled: false dashboards: enabled: false ```

Installation commands:

talosctl apply-config -n [node 1 DHCP IP] --insecure -f test/nodes/1.yaml
talosctl bootstrap -n 192.168.250.1
# Wait for bootstrap
helm install cilium cilium/cilium --namespace=kube-system -f cilium/values.yaml
# Wait for cilium pods to be running
talosctl apply-config -n [node 2 DHCP IP] --insecure -f test/nodes/2.yaml
talosctl apply-config -n [node 3 DHCP IP] --insecure -f test/nodes/3.yaml

The test/nodes/1.yaml, test/nodes/2.yaml and test/nodes/3.yaml files only differ by machine.network.hostname and machine.network.interfaces.addresses. They are just set so that I can quickly apply them one after the other for testing.

I'm running the following upgrade command:

talosctl upgrade -n 192.168.250.1 --image ghcr.io/siderolabs/installer:v1.5.3 --wait --debug
smira commented 1 year ago

thanks for providing so many details, I still couldn't reproduce this with main branch.

it looks like in your config service discovery is disabled on Talos side, this is not recommended, but still works.

I tested with Cilium in strict mode and service discovery disabled. The VIP is removed before any actual upgrade actions start, and VIP is not used by etcd, so it shouldn't really affect etcd operations in any way.

If you have any additional details (e.g. Talos logs?), I could try to dig more, but without a way to reproduce it's hard to move on.