siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.39k stars 514 forks source link

Upgrading 1.7.1->1.7.2 causes etcd issue #8759

Closed evanrich closed 3 months ago

evanrich commented 3 months ago

Bug Report

Description

Upgrading 4 node cluster node-by-node to 1.7.2 from 1.7.1: First 2 nodes succeeded, waited each time for 20 minutes for things to settle, went to do 3rd, ran into an ETCD issue similar to https://github.com/siderolabs/talos/issues/7423

192.168.5.10 and 192.168.5.11 were successful, went to go upgrade the 3rd control plane node 192.168.5.12 and it almost immediately failed. My process is to do the following:

1.) cordon the node 2.) manually drain the node 3.) once node is drained, run upgrade 4.) wait for upgrade to complete, node to boot 5.) uncordon node once new version is confirmed

I tried to reboot using PXE but etcd just keeps crashing over and over. I'm not sure at this point if I should run -f to force it?

Logs

evan@DESKTOP-MCOE11O:~$ talosctl upgrade --nodes 192.168.5.10       --image factory.talos.dev/installer/01f4371278b976cd3df29b123980c03c1738a63a432af3812b204cbf3b6dcefc:v1.7.2
WARNING: 192.168.5.10: server version 1.7.1 is older than client version 1.7.2
watching nodes: [192.168.5.10]
    * 192.168.5.10: post check passed
evan@DESKTOP-MCOE11O:~$ talosctl upgrade --nodes 192.168.5.11       --image factory.talos.dev/installer/01f4371278b976cd3df29b123
980c03c1738a63a432af3812b204cbf3b6dcefc:v1.7.2
WARNING: 192.168.5.11: server version 1.7.1 is older than client version 1.7.2
watching nodes: [192.168.5.11]
    * 192.168.5.11: post check passed
evan@DESKTOP-MCOE11O:~$ talosctl upgrade --nodes 192.168.5.12       --image factory.talos.dev/installer/01f4371278b976cd3df29b123
980c03c1738a63a432af3812b204cbf3b6dcefc:v1.7.2
WARNING: 192.168.5.12: server version 1.7.1 is older than client version 1.7.2
◱ watching nodes: [192.168.5.12]
    * 192.168.5.12: 1 error(s) occurred:
    sequence error: sequence failed: error running phase 4 in upgrade sequence: task 1/1: failed, failed to leave cluster: 2 error(s) occurred:
    failed to remove member 4288372197695352128: etcdserver: server stopped
    failed to remove member 4288372197695352128: etcdserver: member not found2017:01:22 12:36:09:248 Could not recognize operation: R!
2017:01:22 12:36:09:248 DifxFrontend failed!
2017:01:27 00:37:12:581 Could not recognize operation: R!
2017:01:27 00:37:12:582 DifxFrontend failed!

Environment

evanrich commented 3 months ago

ok fwiw, doing the upgrade with --force fixed everything.