Open tripunovic-uros opened 1 month ago
I cannot reproduce this. Please show the specific steps you are following to rejoin nodes 2 and 3 to the cluster after restoring the snapshot on node 1.
I'll write out all the steps. So the following commands were performed at server 1.
rke2 etcd-snapshot save --name test
After saving the snapshot and getting its real path, this following commands were performed at all three servers:
systemctl stop rke2-server
rke2-killall.sh
After making sure that all servers are down I then performed a cluster reset with the real path on server 1
rke2 server --cluster-reset --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>
systemctl start rke2-server
Once server 1 is up and running the last commands were performed at server 2 and 3
rm -rf /var/lib/rancher/rke2/server/db
systemctl start rke2-server
I just followed the Etcd Backup and Restore guide. The only thing that server 2 and 3 have are these two extra lines in the config.yaml
file:
server: https://rke2.server1:9345
token: my-shared-secret
Ok, so where are the errors on servers 2 and 3? All I see from the brief bit of logs you posted is that they are going through the normal startup process.
Sorry I should've put more emphasis on the startup process. Server 2 and 3 are stuck on the startup process and they never get up and running. I'll post journalctl
logs of server 2 and 3 tomorrow, but both servers are stuck in a infinity loop of getting the node ready.
In addition to the logs from journalctl, also check the etcd and apiserver pod logs under /var/log/pods/
Here comes the log files, These are also the etcd ids in the cluster:
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.0.82:2379 | 5094b2a4960282f | 3.5.9 | 25 MB | false | false | 5 | 32249 | 32249 | |
| https://192.168.0.83:2379 | 76e572814f3fac47 | 3.5.9 | 25 MB | false | false | 5 | 32249 | 32249 | |
| https://192.168.0.81:2379 | e8319108803585bc | 3.5.9 | 39 MB | true | false | 5 | 32249 | 32249 | |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Here are the logs: kube-apiserver.txt journalctl-server2.txt.zip etcd-server2.txt
I also found a older issue of yours on Incompletely joined nodes are not removed from etcd cluster, could this be related?
That issue is from like 4 years ago and is closed, so I would say no its not related.
Oh... I tried reading the logs but it didn't become any clearer. Any ideas?
The pod logs both end around 2024-05-30T06:57:49
when etcd appears to have been cleanly shut down and the service stopped. The journald logs show that the service is frequently crashing on startup due to a bug that has been fixed in newer releases of rke2 - can you upgrade to the latest available release, and see if the issue persists?
Hi sorry for a late reply. I won't be able to get to it immediately but I'll get back to you as soon as possible.
Environmental Info: rke2 version v1.27.10+rke2r1 go version go1.20.13
Node(s) CPU architecture, OS, and Version: Linux x86_64 GNU/Linux
Cluster Configuration: 3 servers, 2 agents
Describe the bug: Following the guide for Restoring a Snapshot to Existing Nodes breaks the cluster after restoration of snapshot from server 1.
Steps To Reproduce:
systemctl status rke2-server
andkubectl get nodes
Expected behavior: Server 1, 2, and 3 should be working the same as with the fresh install. Both
systemctl
andkubectl
should show that the nodes are running.Actual behavior: Server 1 works as expected but server 2 and 3 are no longer working.
systemctl
shows that server 2 and 3 are starting whilekubectl
shows that they are ready.Additional context / logs:
Server 3 logs:
Server 2 has the same error messages.
Workaround:
To enable snapshot restoration, server 2 and 3 are required to uninstall rke2 with the
rke2-uninstall.sh
script and then perform a fresh install again. After the fresh install, server 2 and 3 then join server 1 with a join agent again.