Open mueller-tobias opened 6 months ago
We have the same issue on some downstream clusters. We opened a case with Suse, they pointed us to: https://www.suse.com/support/kb/doc/?id=000021078 but this doesn't helped with our issue. On one of the cluster I was able to fix it with temporary reduce the retention to 3 and the interval to 5min, then wait until rancher cleaned up everything.
Same as @betweenclouds - tried pretty much everything I could find, but for me my downstream cluster still shows 0bytes even tough the files are there locally and on S3...
I eventually got it to work. What I ended up doing is manually deleting snapshot files (s3 and local, after backing them up somewhere else). Then I disabled etcd stapshot completly. I then deleted the configmap "rke2-etcd-snapshots" and all of the ETCDSnapshotFile. I then proceeded to reboot all my etcds and control planes 1 after the other. I then Re-enabled the snapshots with a retention of 3 every 5 minutes. I left it as is - and some snapshot were still showing 0 - I went ahead and re-rebooted all the etcds and control planes. After the reboots, it seemed like the consolidation was finally done and all my snapshots are showing correctly now. I re-enabled s3 and all is okay... A bit of a nuke solution, but I wasn't able to make it work any other way...
The workaround didn't help on my cluster. But thanks for the tipps!
Actually - I just had that very same issue again on another downstream cluster - and it took me a while but I finally got it to show the snapshots... Just to be clear, let me do a step by step.
1 - Completly deactivate ETCDs snapshots (s3 or local)
2 - Reboot ETCDs and Control planes 1 by 1 ETCDs first then Control planes - or both if you have both on the same nodes.
3 - Remove All ETCDSnapshotFiles - Also remove all the files on S3 and in /var/lib/rancher/rke2/server/db/snapshots might want to keep a few as backup in case something goes wrong
kubectl get etcdsnapshotfiles -o json | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name' | while read -r snapshot; do kubectl patch etcdsnapshotfile "$snapshot" --type='json' -p='[{"op": "remove", "path": "/metadata/finalizers"}]' echo "Removed finalizers from $snapshot" done
4 - Remove the "rke2-etcd-snapshots" config map
5 - Reboot Etcds and Control Planes again
6 - Activate the snapshots locally - no S3 for now, I set 10 minutes and 10 retention.
7 - You should see empty snapshots again after some time. Let it do 2 or 3 snapshots schedules
8 - Again, reboot all Etcds then Control PLanes - you should see the following in your logs:` ApplyJob HelmChart rke2-snapshot-validation-webhook Applying HelmChart using Job kube-system/helm-install-rke2-snapshot-validation-webhook
ApplyJob HelmChart rke2-snapshot-controller Applying HelmChart using Job kube-system/helm-install-rke2-snapshot-controller ` 9 - Check if the "rke2-etcd-snapshots" config map has been recreated after the jobs are finished 10 - after a few minutes, you should start seeing both - sucessful snapshots and older ones empty. 11 - let the retention remove the old empty snapshots 12 - once all snapshots are "sucessful" - reactivate S3 - and from there you should be good.
I'm really not sure why this ends up working, but it did for now 3 downstream clusters. They are all configured the same, so I might have a config that differs, but eventually it seems to be working.
Same issue in my single-docker-Rancher lab. My downstream RKE2 cluster was created as v1.27-something (latest rke2 available on March 12.) with Rancher 2.8.2, on-demand backups and automatic etcd-snapshots were working fine until today. I have recently patched and rebooted my server with the rancher instance and upgraded Rancher to v2.8.3. The cluser was then updated to v1.28.8+rke2r1, It's been a few days and automatic snapshots were working fine. Started this morning, new on-demand backups and automatic etcd-snapshots are "0" sized and can not be addressed to, even though existent on the servers.
I've just created a new RKE2 downstream cluster (v1.28.8+rke2r1, using Rancher v2.8.3) and I have no issues with on-demand backups or automatic etcd snapshots. Also no issues after restarting Rancher container. The older cluster still shows "0" on recent and new backups.
Could be related to:
Our one-liner to patch each cluster until fix is available:
kubectl get lease -n kube-system rke2-etcd -o jsonpath='{.spec.holderIdentity}' | xargs -I {} kubectl patch lease -n kube-system --patch='{"spec":{"holderIdentity":"{}"}}' --type=merge rke2; kubectl get lease -n kube-system -o custom-columns='name:.metadata.name,holder:.spec.holderIdentity' | grep rke2
I'm having an odd issue similar to this as well. We recently upgrade 2 of our clusters from Rancher Version 2.7.4 --> 2.8.2 and Kubernetes versions 1.25.9 --> 1.27.13/14. Snapshots are now only visible in the Rancher UI for 1 of our 3 control nodes in each of our 4 clusters. I can check the logs on the 2 other control nodes for each cluster and see the snapshots are taking but they are not uploading into the Rancher UI, they are only upload for a single control node.
Additionally, if I select an On-Demand snapshot, the UI uploads 3 snapshots from a single control node and on the node itself, I see 3 "on-demand" snapshots. The 2 other control nodes do not appear to acknowledge the on-demand request
Yhea this has been an annoyance... the only thing that 100% fixed it for me was to upgrade to 1.28.9. I had issues with 1.26 at some point, upgrading to 1.27 didn't change anything - same issues, but all my clusters that were upgraded to 1.28.9 self-resolved...
Yhea this has been an annoyance... the only thing that 100% fixed it for me was to upgrade to 1.28.9. I had issues with 1.26 at some point, upgrading to 1.27 didn't change anything - same issues, but all my clusters that were upgraded to 1.28.9 self-resolved...
Thanks for the heads up! Definitely a huge annoyance for us
We still have the same issue with rke2 1.28.9. I think related to: https://github.com/rancher/rke2/issues/5866 and wait till v1.28.11
Try removing the snapshots on all etcds and the snapshot configmap - disable auto snapshot - re-enable (without s3), wait for a few scheduled snapshots, they should show up eventually. When I upgraded to 1.28, I had already disabled the auto snapshots and deleted the etcd snapshots on the nodes. Hopefully this resolves your issue. If you have s3 snapshots, delete them also. Make sure to keep a few as a backup.
@RegisHubelia thanks for your procedure(handy for other people), but I already did that and still after some time we get snapshots with 0 bytes. We have more then 50 clusters and it's not nice if you must do this procedure for each cluster. Also occur on clean new installed clusters with 1.28.9.
We have a corn job running this on each downstream rke2 until it’s ultimately fixedetcd snapshots showing 0B size in the Rancher UI |...suse.comOn Jun 20, 2024, at 10:09 AM, nickvth @.***> wrote: @RegisHubelia thanks for your procedure, but i already did that and still after some time you get snapshots with 0 bytes. Also occur on clean new installed clusters with 1.28.9.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>
We have a cron job iterating through each rke2 downstream cluster updating the lease objects
Yhea - I can confirm, even tough it fixes it - after a while, the issue is back... That workaround kubectl get lease -n kube-system rke2-etcd -o jsonpath='{.spec.holderIdentity}' | xargs -I {} kubectl patch lease -n kube-system --patch='{"spec":{"holderIdentity":"{}"}}' --type=merge rke2; kubectl get lease -n kube-system -o custom-columns='name:.metadata.name,holder:.spec.holderIdentity' | grep rke2
- seems to work. Thanks @atsai1220
Rancher Server Setup
Information about the Cluster
User Information
Describe the bug
The snapshots were working fine until we configured the additional s3 backup in the cluster configuration. Since then in the cluster management/snapshot view the snapshots are displayed with 0B
The problem seems to be that the kube-system/rke2-etcd-snapshots configmap is not updated. On the ETCD Nodes the snapshots are created properly and also the transfer to s3 is working properly. On the ETCD Node i can view the snapshots with
rke2 etcd-snapshot list --etcd-s3
and see the local one on the node and the s3 snapshots. The Job on the nodes runs and i can validate in the logs that it ends with exit 0 and no error.In the Rancher Pods i've a lot of entries like this:
Here's an example of the an ETCDSnapshot CRD:
When i ssh on the node the snapshot exists at
/var/lib/rancher/rke2/server/db/snapshots/on-demand-mycluster-etcd-z1-d39f123e-xj28d-1713271295 name: on-demand-mycluster-etcd-z1-d39f123e-xj28d-1713271295
I can fix the 0b entries when i add the snapshot data to the kube-system/rke2-etcd-snapshots. I get the data from
rke2 etcd-snapshot list --etcd-s3 -o json
. But couldn't find any erors related to updating the configmap.What service/agent is responsible for updating the ConfigMap?
To Reproduce Couldn't reproduce the Issue with another cluster on IONOS.
Result
New Snapshots through the schedule or on demand trigger more 0 byte snapshots in the Cluster Management
Expected Result
New Snapshots are inventoried properly
Screenshots
Additional context