k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.65k stars 2.32k forks source link

When upgrading a ha k3s cluster, etcd members get removed #7613

Closed arturp89 closed 1 year ago

arturp89 commented 1 year ago

Environmental Info: K3s Version: upgrading from 1.21.4 to 1.22.17 upgraded version output: k3s version v1.22.17+k3s1 (3ed243df) go version go1.16.15

Node(s) CPU architecture, OS, and Version: Linux node2 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux Linux node1 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux Linux node3 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 server nodes

Describe the bug: Trying to upgrade an HA cluster of k3s from 1.21.4 to 1.22.17 we are doing it in a "custom" way following your guidelines for manually updating an airgapped environment. We do it as follows in a 3 master node cluster:

We notice that after restarting the leader node, etcd deletes the two non leader members from the etcd cluster and the only way to resolve the issue is to manually go to each of the nodes and restart the k3s service to trigger a rejoining to cluster.

Steps To Reproduce:

the kubelet config file

apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration shutdownGracePeriod: 600s shutdownGracePeriodCriticalPods: 300s


k3s systemd service file:

[Unit] Description=Aurora Kubernetes Documentation=https://k3s.io Wants=network-online.target After=network-online.target

[Install] WantedBy=multi-user.target

[Service] Type=notify EnvironmentFile=-/etc/default/%N EnvironmentFile=-/etc/sysconfig/%N EnvironmentFile=-/etc/systemd/system/k3s.env KillMode=process Delegate=yes

Having non-zero Limit*s causes performance problems due to accounting overhead

in the kernel. We recommend using cgroups to do container-local accounting.

LimitNOFILE=1048576 LimitNPROC=infinity LimitCORE=infinity TasksMax=infinity TimeoutStartSec=0 Restart=always RestartSec=5s ExecStartPre=-/sbin/modprobe br_netfilter ExecStartPre=-/sbin/modprobe overlay ExecStart=/usr/local/bin/k3s server


- Add nodes to the cluster
   add two server nodes by joining them to the cluster using the token.
   the nodes will have identical configs to the above with an additional "server" field pointing to the first node.

- Update the clutser to v1.22.17+k3s1 via the procedure already described above:
      - push new airgapped images to private registry 
      - For each of the two nodes which aren't the k3s/etcd leader we do, in sequential order:

              1. Replace the airgapped images package and the k3s binary
              2. Remove unsupported config options in new k3s version (ServiceTopology feature-gates)
              3. Restart the k3s service
              4. validate that all system components are ready

- After the above steps finish for each of the 2 nodes. we perform the same step for the leader node.

**Expected behavior:**
<!-- A clear and concise description of what you expected to happen. -->
All of the nodes to rejoin automatically after the restarts and to have a working k3s v1.22.17 at the end. 

**Actual behavior:**
<!-- A clear and concise description of what actually happened. -->
After following the above procedure, while the first nodes restart and rejoin the existing cluster successfully,
after restarting the final leader node, it removes the first two prior nodes from the etcd cluster and we end up
with two nodes in "NotReady" state and require restarting them again in order for them to rejoin the etcd cluster
and become "Ready".

**Additional context / logs:**
<!-- Add any other context and/or logs about the problem here. -->
I attached the journalctl logs of all 3 nodes from around the time of the restarts
You can see that at 16:14:09 in the logs of node1 and node3 they are remove from the etcd cluster

May 26 16:14:09 node1 k3s[98366]: {"level":"warn","ts":"2023-05-26T16:14:09.647+0300","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"b91b8e6da856208d","error":"failed to dial b91b8e6da856208d on stream MsgApp v2 (the member has been permanently removed from the cluster)"} May 26 16:14:09 node1 k3s[98366]: time="2023-05-26T16:14:09+03:00" level=info msg="this node has been removed from the cluster please restart k3s to rejoin the cluster" May 26 16:14:09 node1 k3s[98366]: {"level":"warn","ts":"2023-05-26T16:14:09.647+0300","caller":"etcdserver/server.go:1150","msg":"server error","error":"the member has been permanently removed from the cluster"} May 26 16:14:09 node1 k3s[98366]: {"level":"warn","ts":"2023-05-26T16:14:09.647+0300","caller":"etcdserver/server.go:1151","msg":"data-dir used by this member must be removed"}

May 26 16:14:09 node3 k3s[116420]: {"level":"warn","ts":"2023-05-26T16:14:09.715+0300","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"b91b8e6da856208d","error":"failed to dial b91b8e6da856208d on stream MsgApp v2 (the member has been permanently removed from the cluster)"} May 26 16:14:09 node3 k3s[116420]: time="2023-05-26T16:14:09+03:00" level=info msg="this node has been removed from the cluster please restart k3s to rejoin the cluster" May 26 16:14:09 node3 k3s[116420]: {"level":"warn","ts":"2023-05-26T16:14:09.715+0300","caller":"etcdserver/server.go:1150","msg":"server error","error":"the member has been permanently removed from the cluster"} May 26 16:14:09 node3 k3s[116420]: {"level":"warn","ts":"2023-05-26T16:14:09.715+0300","caller":"etcdserver/server.go:1151","msg":"data-dir used by this member must be removed"} May 26 16:14:09 node3 k3s[116420]: {"level":"info","ts":"2023-05-26T16:14:09.715+0300","caller":"rafthttp/peer.go:330","msg":"stopping remote peer","remote-peer-id":"db71ac7cc58ff7cf"}


Which is when the leader node (node2) appears to start the etcd cluster after the upgrade:

May 26 16:14:08 node2 k3s[57058]: {"level":"info","ts":"2023-05-26T16:14:08.995+0300","caller":"embed/serve.go:140","msg":"serving client traffic insecurely; this is strongly discouraged!","address":"127.0.0.1:2399"} May 26 16:14:08 node2 k3s[57058]: time="2023-05-26T16:14:08+03:00" level=info msg="Defragmenting etcd database" May 26 16:14:08 node2 k3s[57058]: {"level":"info","ts":"2023-05-26T16:14:08.998+0300","caller":"v3rpc/maintenance.go:89","msg":"starting defragment"} May 26 16:14:09 node2 k3s[57058]: {"level":"info","ts":"2023-05-26T16:14:09.001+0300","caller":"backend/backend.go:497","msg":"defragmenting","path":"/var/lib/rancher/k3s/server/db/etcd-tmp/member/snap/db","current-db-size-bytes":36290560,"current-db-size":"36 MB","current-db-size-in-use-bytes":14327808,"current-db-size-in-use":"14 MB"} May 26 16:14:09 node2 k3s[57058]: {"level":"info","ts":"2023-05-26T16:14:09.183+0300","caller":"backend/backend.go:549","msg":"finished defragmenting directory","path":"/var/lib/rancher/k3s/server/db/etcd-tmp/member/snap/db","current-db-size-bytes-diff":-22413312,"current-db-size-bytes":13877248,"current-db-size":"14 MB","current-db-size-in-use-bytes-diff":-458752,"current-db-size-in-use-bytes":13869056,"current-db-size-in-use":"14 MB","took":"184.81028ms"} May 26 16:14:09 node2 k3s[57058]: {"level":"info","ts":"2023-05-26T16:14:09.183+0300","caller":"v3rpc/maintenance.go:95","msg":"finished defragment"} May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=info msg="etcd temporary data store connection OK" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=info msg="Reconciling bootstrap data between datastore and disk" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=info msg="Migrating bootstrap data to new format" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling ServerCAKey at '/var/lib/rancher/k3s/server/tls/server-ca.key'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling ServiceKey at '/var/lib/rancher/k3s/server/tls/service.key'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling ClientCA at '/var/lib/rancher/k3s/server/tls/client-ca.crt'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling PasswdFile at '/var/lib/rancher/k3s/server/cred/passwd'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling RequestHeaderCA at '/var/lib/rancher/k3s/server/tls/request-header-ca.crt'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling RequestHeaderCAKey at '/var/lib/rancher/k3s/server/tls/request-header-ca.key'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling ServerCA at '/var/lib/rancher/k3s/server/tls/server-ca.crt'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling IPSECKey at '/var/lib/rancher/k3s/server/cred/ipsec.psk'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling ClientCAKey at '/var/lib/rancher/k3s/server/tls/client-ca.key'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling ETCDPeerCA at '/var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling ETCDPeerCAKey at '/var/lib/rancher/k3s/server/tls/etcd/peer-ca.key'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling ETCDServerCA at '/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=debug msg="Reconciling ETCDServerCAKey at '/var/lib/rancher/k3s/server/tls/etcd/server-ca.key'" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=info msg="stopping etcd" May 26 16:14:09 node2 k3s[57058]: {"level":"info","ts":"2023-05-26T16:14:09.192+0300","caller":"embed/etcd.go:368","msg":"closing etcd server","name":"node2-135ff765","data-dir":"/var/lib/rancher/k3s/server/db/etcd-tmp","advertise-peer-urls":["http://127.0.0.1:2400"],"advertise-client-urls":["http://127.0.0.1:2399"]} May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=info msg="certificate CN=system:apiserver,O=system:masters signed by CN=k3s-client-ca@1685005852: notBefore=2023-05-25 09:10:52 +0000 UTC notAfter=2024-05-25 13:14:09 +0000 UTC" May 26 16:14:09 node2 k3s[57058]: {"level":"info","ts":"2023-05-26T16:14:09.194+0300","caller":"etcdserver/server.go:1453","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"b91b8e6da856208d","current-leader-member-id":"b91b8e6da856208d"} May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=info msg="certificate CN=kube-apiserver signed by CN=k3s-server-ca@1685005852: notBefore=2023-05-25 09:10:52 +0000 UTC notAfter=2024-05-25 13:14:09 +0000 UTC" May 26 16:14:09 node2 k3s[57058]: {"level":"info","ts":"2023-05-26T16:14:09.202+0300","caller":"embed/etcd.go:563","msg":"stopping serving peer traffic","address":"127.0.0.1:2400"} May 26 16:14:09 node2 k3s[57058]: {"level":"info","ts":"2023-05-26T16:14:09.202+0300","caller":"embed/etcd.go:568","msg":"stopped serving peer traffic","address":"127.0.0.1:2400"} May 26 16:14:09 node2 k3s[57058]: {"level":"info","ts":"2023-05-26T16:14:09.202+0300","caller":"embed/etcd.go:370","msg":"closed etcd server","name":"node2-135ff765","data-dir":"/var/lib/rancher/k3s/server/db/etcd-tmp","advertise-peer-urls":["http://127.0.0.1:2400"],"advertise-client-urls":["http://127.0.0.1:2399"]} May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=info msg="certificate CN=etcd-server signed by CN=etcd-server-ca@1685005852: notBefore=2023-05-25 09:10:52 +0000 UTC notAfter=2024-05-25 13:14:09 +0000 UTC" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=info msg="certificate CN=etcd-peer signed by CN=etcd-peer-ca@1685005852: notBefore=2023-05-25 09:10:52 +0000 UTC notAfter=2024-05-25 13:14:09 +0000 UTC" May 26 16:14:09 node2 k3s[57058]: time="2023-05-26T16:14:09+03:00" level=info msg="Starting etcd for existing cluster member"



[node1.log](https://github.com/k3s-io/k3s/files/11576030/node1.log)
[node2.log](https://github.com/k3s-io/k3s/files/11576047/node2.log)
[node3.log](https://github.com/k3s-io/k3s/files/11576058/node3.log)
brandond commented 1 year ago

May 26 16:14:09 node1 k3s[98366]: time="2023-05-26T16:14:09+03:00" level=info msg="this node has been removed from the cluster please restart k3s to rejoin the cluster

Something is deleting the Kubernetes node object from the cluster, which triggers removal from the etcd cluster as well. This is the only thing that will cause this. Have you deployed a cloud provider integration, or anything else that might be trying to manage your nodes? I see you've disabled the default cloud controller, so you must have replaced it with something. That is the most likely culprit.

arturp89 commented 1 year ago

We have no component that manages the nodes. we have a cli that allows the user to perform various actions on the cluster, however there is no controller which actively monitors and manages the nodes in some way. About the deletion of the node objects, nothing is or should be deleting them during this process as far as i can tell , further more the node objects exist at all points (just transition into "NotReady" state once their members are deleted from the etcd cluster).

However a possibly important detail i might have omitted is that we have a symlink to the etcd directory, looking at the files in the directories it appears that they are out of sync, with one folder containing a "tombstone" file created at the time it was removed from the etcd cluster. This is the case for the two nodes that were deleted from the etcd cluster and it appears that the symlinks are removed (due to some fault in our process i assume), the directories for the "leader" node are synched. Note that i restarted manually the two nodes to cause them to rejoin the cluster which probably caused the new files to be created and updated. Do you think this might be causing the issue?

[root@node1 ~]# ll -la /var/lib/rancher/k3s/server/db/etcd
total 8
drwx------ 3 root root   46 May 27 21:02 .
drwxr-x--- 6 root root  152 May 27 21:02 ..
-rw------- 1 root root 1099 May 27 21:02 config
drwx------ 4 root root   29 May 27 21:02 member
-rw------- 1 root root   14 May 27 21:02 name
[root@node1 ~]# ll -la <symlinked path>/etcd/
total 8
drwx------ 3 root root  63 May 26 16:14 .
drwxr-x--- 3 root root  18 May 25 12:40 ..
-rw------- 1 root root 965 May 26 16:12 config
drwx------ 4 root root  29 May 26 16:12 member
-rw------- 1 root root  14 May 26 15:59 name
-rw------- 1 root root   0 May 26 16:14 tombstone
brandond commented 1 year ago

The tombstone file is created by the etcd node controller when the Kubernetes node is deleted, to prevent it from automatically rejoining the cluster.

What cloud controller have you deployed in place of the default one, which you have disabled? I understand that you don't believe it is managing nodes, but something is.

brandond commented 1 year ago

If in doubt you might enable audit logging, and see where the delete request is coming from.

arturp89 commented 1 year ago

So i see that in the 1.22.17 cluster i have now (after manually restarting the other two nodes), the issue is consistently recreated by restarting the leader node. I see that the old etcd members get tombstoned and a new etcd folder is created on the two nodes (this also explains the "missing" symlink it was simply on one of the tombstoned etcd folders that got renamed).

I enabled auditing, however i don't see any "delete" node requests or the like in the audit logs, could you please point me as to what else i could look for in the audits as i'm somewhat at a loss here.

Looking at the log they appear to be actively removed from the cluster after failed tls-handshakes:

May 28 17:35:40 node2 k3s[97762]: time="2023-05-28T17:35:40+03:00" level=info msg="Wrote kubeconfig /etc/rancher/k3s/k3s.yaml"
May 28 17:35:40 node2 k3s[97762]: time="2023-05-28T17:35:40+03:00" level=info msg="Run: k3s kubectl"
May 28 17:35:40 node2 k3s[97762]: time="2023-05-28T17:35:40+03:00" level=info msg="Cluster-Http-Server 2023/05/28 17:35:40 http: TLS handshake error from 127.0.0.1:37990: remote error: tls: bad certificate"
May 28 17:35:40 node2 k3s[97762]: time="2023-05-28T17:35:40+03:00" level=info msg="Cluster-Http-Server 2023/05/28 17:35:40 http: TLS handshake error from 127.0.0.1:37996: remote error: tls: bad certificate"
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.147+0300","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b91b8e6da856208d switched to configuration voters=(7771332075759455463 13338411323014652045)"}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.147+0300","caller":"membership/cluster.go:472","msg":"removed member","cluster-id":"3ab7a4daaf7a2e44","local-member-id":"b91b8e6da856208d","removed-remote-peer-id":"1b2165dceef1eee2","removed-remote-peer-urls":["https://192.168.32.88:2380"]}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.147+0300","caller":"rafthttp/peer.go:330","msg":"stopping remote peer","remote-peer-id":"1b2165dceef1eee2"}
May 28 17:35:40 node2 k3s[97762]: {"level":"warn","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/stream.go:286","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"1b2165dceef1eee2"}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"1b2165dceef1eee2"}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"1b2165dceef1eee2"}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/pipeline.go:85","msg":"stopped HTTP pipelining with remote peer","local-member-id":"b91b8e6da856208d","remote-peer-id":"1b2165dceef1eee2"}
May 28 17:35:40 node2 k3s[97762]: {"level":"warn","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"b91b8e6da856208d","remote-peer-id":"1b2165dceef1eee2","error":"context canceled"}
May 28 17:35:40 node2 k3s[97762]: {"level":"warn","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"1b2165dceef1eee2","error":"failed to read 1b2165dceef1eee2 on stream MsgApp v2 (context canceled)"}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"b91b8e6da856208d","remote-peer-id":"1b2165dceef1eee2"}
May 28 17:35:40 node2 k3s[97762]: {"level":"warn","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"b91b8e6da856208d","remote-peer-id":"1b2165dceef1eee2","error":"context canceled"}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"b91b8e6da856208d","remote-peer-id":"1b2165dceef1eee2"}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/peer.go:335","msg":"stopped remote peer","remote-peer-id":"1b2165dceef1eee2"}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/transport.go:355","msg":"removed remote peer","local-member-id":"b91b8e6da856208d","removed-remote-peer-id":"1b2165dceef1eee2"}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.148+0300","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"b91b8e6da856208d switched to configuration voters=(13338411323014652045)"}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.148+0300","caller":"membership/cluster.go:472","msg":"removed member","cluster-id":"3ab7a4daaf7a2e44","local-member-id":"b91b8e6da856208d","removed-remote-peer-id":"6bd951593f3dcce7","removed-remote-peer-urls":["https://192.168.32.170:2380"]}
May 28 17:35:40 node2 k3s[97762]: {"level":"info","ts":"2023-05-28T17:35:40.148+0300","caller":"rafthttp/peer.go:330","msg":"stopping remote peer","remote-peer-id":"6bd951593f3dcce7"}
brandond commented 1 year ago

Have you deployed a cloud provider integration, or anything else that might be trying to manage your nodes? I see you've disabled the default cloud controller, so you must have replaced it with something.

Can you answer this question?

brandond commented 1 year ago

Also, I see this in the logs:

May 26 16:13:44 node2 k3s[52656]: {"level":"warn","ts":"2023-05-26T16:13:44.603+0300","caller":"etcdserver/cluster_util.go:207","msg":"leader found higher-versioned member","local-member-version":"3.4.13","remote-member-id":"db71ac7cc58ff7cf","remote-member-version":"3.5.4"}
May 26 16:13:44 node2 k3s[52656]: {"level":"warn","ts":"2023-05-26T16:13:44.603+0300","caller":"etcdserver/cluster_util.go:207","msg":"leader found higher-versioned member","local-member-version":"3.4.13","remote-member-id":"9764c2e7d621b29c","remote-member-version":"3.5.4"}
May 26 16:13:48 node2 k3s[52656]: {"level":"warn","ts":"2023-05-26T16:13:48.607+0300","caller":"etcdserver/cluster_util.go:207","msg":"leader found higher-versioned member","local-member-version":"3.4.13","remote-member-id":"9764c2e7d621b29c","remote-member-version":"3.5.4"}
May 26 16:13:48 node2 k3s[52656]: {"level":"warn","ts":"2023-05-26T16:13:48.607+0300","caller":"etcdserver/cluster_util.go:207","msg":"leader found higher-versioned member","local-member-version":"3.4.13","remote-member-id":"db71ac7cc58ff7cf","remote-member-version":"3.5.4"}

You also mentioned

After the above steps finish for each of the 2 nodes. we perform the same step for the leader node.

In general you should upgrade the servers, in particular the leading servers, first. I suspect that you are breaking the etcd cluster by upgrading followers from 3.4 to 3.5 before the leader has been upgraded.

Additionally, 1.21 and 1.22 are both long end-of-life. 1.24 just went end-of-maintenance upstream; you should be looking at stepping through up to 1.25 at the very least.

arturp89 commented 1 year ago

About the node management, we have no active component that manages the nodes, we ship a cli that among others allows the users to perform various node management actions but it's something that's explicitly invocated, not an active component that manages the cluster in some way.

The update order is because we use jobs, also triggered via the cli, to upgrade the cluster (similar in some ways to the k3s upgrade controller). And updating the the leader node first causes the same etcd member removal which means we can't run the jobs on the other to nodes as they enter "NotReady" phase. I'm trying to figure out how to view the etcd logs directly but there is no active etcd container as far as i can tell, do i have some way to reach the logs?

Ideally we would upgrade to a higher version than 1.24 but it's not feasible right now since the newer kubernetes versions would break too much stuff in our deployment.

brandond commented 1 year ago

Etcd runs in the main k3s process; the logs are mixed in with everything else in the files you attached.

I've never seen etcd remove members from the cluster without it being manually triggered by either a cluster-reset, or by an administrative client directly requesting it. On the other hand I've never seen someone intentionally upgrade the cluster in reverse, doing the followers first.

I would recommend that you:

arturp89 commented 1 year ago

In the end it seems like the issue was the symlink on the etcd directory (possibly causing the initial reconciliation tests to fail or something? no idea). Removing the symlink and using the default etcd dir appears to have resolved the issue.

dereknola commented 1 year ago

Closing as comments indicate that the issue has been resolved. @arturp89 If you feel this is not the case, please reopen the issue.