Etcd Backup and Restore Breaks The Cluster

tripunovic-uros commented 1 month ago

Environmental Info: rke2 version v1.27.10+rke2r1 go version go1.20.13

Node(s) CPU architecture, OS, and Version: Linux x86_64 GNU/Linux

Cluster Configuration: 3 servers, 2 agents

Describe the bug: Following the guide for Restoring a Snapshot to Existing Nodes breaks the cluster after restoration of snapshot from server 1.

Steps To Reproduce:

Create cluster with one node
Create additional two servers with Join Agent Nodes
Follow the Restoring a Snapshot to Existing Nodes guide.
Check status for server 2 and 3 with systemctl status rke2-server and kubectl get nodes

Expected behavior: Server 1, 2, and 3 should be working the same as with the fresh install. Both systemctl and kubectl should show that the nodes are running.

Actual behavior: Server 1 works as expected but server 2 and 3 are no longer working. systemctl shows that server 2 and 3 are starting while kubectl shows that they are ready.

Additional context / logs:

Server 3 logs:

● rke2-server.service - Rancher Kubernetes Engine v2 (server)
   Loaded: loaded (/usr/lib/systemd/system/rke2-server.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2024-05-29 12:15:31 UTC; 6s ago
     Docs: https://github.com/rancher/rke2#readme
  Process: 2225358 ExecStopPost=/bin/sh -c systemd-cgls /system.slice/rke2-server.service | grep -Eo '[0-9]+ (containerd|kubelet)' | awk '{print $1}' | xargs -r kill (code=exited, status=0/SUCCESS)
  Process: 2225388 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
  Process: 2225387 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
  Process: 2225385 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
 Main PID: 2225389 (rke2)
    Tasks: 143
   Memory: 3.7G
   CGroup: /system.slice/rke2-server.service
           ├─2034936 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id 5cbfbfabff711632b861f8cf52dda8612e9baaea75b882633d1570b18311e16d -address /run/k3s/containerd/containerd.sock
           ├─2035505 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id 1d383817327dfa7c0eba90069fb71372cffb50d7a9814ee5cffddff2f692cf9b -address /run/k3s/containerd/containerd.sock
           ├─2036191 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id 970522cf848015042f90207648b5828a99cc8c6f05eda415c31ab4eeeeba46a3 -address /run/k3s/containerd/containerd.sock
           ├─2037957 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id e6d1c8b1f906e30ba1ea399708148a059a79ccbad2a51e133d5fdbb0c0c19270 -address /run/k3s/containerd/containerd.sock
           ├─2089307 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id d10be6db6fc10e293cd97d80ff0d6c56fe2f433f339f63a0ed57b6853eb409f3 -address /run/k3s/containerd/containerd.sock
           ├─2089524 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id a8c218833bb001ca324c2960a99f12c145911d385b4e9b49def0ebd46a88fd25 -address /run/k3s/containerd/containerd.sock
           ├─2090110 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id a9c9272842f8775c653e56548851a811749f94a300468b9eb2c779e6b5913667 -address /run/k3s/containerd/containerd.sock
           ├─2090116 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id 1bcf16f4d5ec83bd00802f06de80e384de04eda1c728f581e4fdd6a408c2f94b -address /run/k3s/containerd/containerd.sock
           ├─2225389 /usr/bin/rke2 server
           ├─2225421 containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd
           └─2225544 kubelet --volume-plugin-dir=/var/lib/kubelet/volumeplugins --file-check-frequency=5s --sync-frequency=30s --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=systemd --client-ca-file=/var/lib/rancher/rke2/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-log-max-files=5 --container-log-max-size=10Mi --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --feature-gates=CloudDualStackNodeIPs=true --feature-gates=RotateKubeletServerCertificate=true --healthz-bind-address=127.0.0.1 --hostname-override=server3.rke2 --kubeconfig=/var/lib/rancher/rke2/agent/kubelet.kubeconfig --node-ip=192.168.0.83 --node-labels= --pod-infra-container-image=someurl/rancher/mirrored-pause:3.6 --pod-manifest-path=/var/lib/rancher/rke2/agent/pod-manifests --pod-max-pids=32768 --protect-kernel-defaults=true --read-only-port=0 --register-with-taints=node-role.kubernetes.io/etcd=true:NoExecute,node-role.kubernetes.io/control-plane:NoSchedule --resolv-conf=/etc/resolv.conf --rotate-certificates=true --serialize-image-pulls=false --streaming-connection-idle-timeout=5m --tls-cert-file=/var/lib/rancher/rke2/agent/serving-kubelet.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 --tls-private-key-file=/var/lib/rancher/rke2/agent/serving-kubelet.key

May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=error msg="Error encountered while importing /var/lib/rancher/rke2/agent/images/runtime-image.txt: failed to pull images from /var/lib/rancher/rke2/agent/images/runtime-image.txt: image \"someurl/rancher/rke2-runtime:v1.27.10-rke2r1\": not found"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=debug msg="Kubelet image credential provider bin directory check failed: stat /var/lib/rancher/credentialprovider/bin: no such file or directory" 
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Running kubelet --address=0.0.0.0 --alsologtostderr=false --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=systemd --client-ca-file=/var/lib/rancher/rke2/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-log-max-files=5 --container-log-max-size=10Mi --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --feature-gates=CloudDualStackNodeIPs=true --feature-gates=RotateKubeletServerCertificate=true --healthz-bind-address=127.0.0.1 --hostname-override=server3.rke2 --kubeconfig=/var/lib/rancher/rke2/agent/kubelet.kubeconfig --log-file=/var/lib/rancher/rke2/agent/logs/kubelet.log --log-file-max-size=50 --logtostderr=false --node-ip=192.168.0.83 --node-labels= --pod-infra-container-image=someurl/rancher/mirrored-pause:3.6 --pod-manifest-path=/var/lib/rancher/rke2/agent/pod-manifests --pod-max-pids=32768 --protect-kernel-defaults=true --read-only-port=0 --register-with-taints=node-role.kubernetes.io/etcd=true:NoExecute,node-role.kubernetes.io/control-plane:NoSchedule --resolv-conf=/etc/resolv.conf --rotate-certificates=true --serialize-image-pulls=false --stderrthreshold=FATAL --streaming-connection-idle-timeout=5m --tls-cert-file=/var/lib/rancher/rke2/agent/serving-kubelet.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 --tls-private-key-file=/var/lib/rancher/rke2/agent/serving-kubelet.key"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Connecting to proxy" url="wss://127.0.0.1:9345/v1-rke2/connect"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=debug msg="Tunnel server handing HTTP/1.1 GET request for /v1-rke2/connect from 127.0.0.1:58288"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Handling backend connection request [server3.rke2]"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Cluster-Http-Server 2024/05/29 12:15:37 http: TLS handshake error from 127.0.0.1:58290: remote error: tls: bad certificate"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Annotations and labels have already set on node: server3.rke2"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

Server 2 has the same error messages.

Workaround:

To enable snapshot restoration, server 2 and 3 are required to uninstall rke2 with the rke2-uninstall.sh script and then perform a fresh install again. After the fresh install, server 2 and 3 then join server 1 with a join agent again.

brandond commented 1 month ago

I cannot reproduce this. Please show the specific steps you are following to rejoin nodes 2 and 3 to the cluster after restoring the snapshot on node 1.

tripunovic-uros commented 1 month ago

I'll write out all the steps. So the following commands were performed at server 1.

rke2 etcd-snapshot save --name test

After saving the snapshot and getting its real path, this following commands were performed at all three servers:

systemctl stop rke2-server
rke2-killall.sh

After making sure that all servers are down I then performed a cluster reset with the real path on server 1

rke2 server --cluster-reset --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>
systemctl start rke2-server

Once server 1 is up and running the last commands were performed at server 2 and 3

rm -rf /var/lib/rancher/rke2/server/db
systemctl start rke2-server

I just followed the Etcd Backup and Restore guide. The only thing that server 2 and 3 have are these two extra lines in the config.yaml file:

server: https://rke2.server1:9345
token: my-shared-secret

brandond commented 1 month ago

Ok, so where are the errors on servers 2 and 3? All I see from the brief bit of logs you posted is that they are going through the normal startup process.

tripunovic-uros commented 1 month ago

Sorry I should've put more emphasis on the startup process. Server 2 and 3 are stuck on the startup process and they never get up and running. I'll post journalctl logs of server 2 and 3 tomorrow, but both servers are stuck in a infinity loop of getting the node ready.

brandond commented 1 month ago

In addition to the logs from journalctl, also check the etcd and apiserver pod logs under /var/log/pods/

tripunovic-uros commented 1 month ago

Here comes the log files, These are also the etcd ids in the cluster:

+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.0.82:2379 |  5094b2a4960282f |   3.5.9 |   25 MB |     false |      false |         5 |      32249 |              32249 |        |
| https://192.168.0.83:2379 | 76e572814f3fac47 |   3.5.9 |   25 MB |     false |      false |         5 |      32249 |              32249 |        |
| https://192.168.0.81:2379 | e8319108803585bc |   3.5.9 |   39 MB |      true |      false |         5 |      32249 |              32249 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Here are the logs: kube-apiserver.txt journalctl-server2.txt.zip etcd-server2.txt

I also found a older issue of yours on Incompletely joined nodes are not removed from etcd cluster, could this be related?

brandond commented 1 month ago

That issue is from like 4 years ago and is closed, so I would say no its not related.

tripunovic-uros commented 1 month ago

Oh... I tried reading the logs but it didn't become any clearer. Any ideas?

brandond commented 1 month ago

The pod logs both end around 2024-05-30T06:57:49 when etcd appears to have been cleanly shut down and the service stopped. The journald logs show that the service is frequently crashing on startup due to a bug that has been fixed in newer releases of rke2 - can you upgrade to the latest available release, and see if the issue persists?

tripunovic-uros commented 4 weeks ago

Hi sorry for a late reply. I won't be able to get to it immediately but I'll get back to you as soon as possible.

rancher / rke2

Etcd Backup and Restore Breaks The Cluster #6050