[BUG] ETCD Backup for downstream cluster not properly inventoried

mueller-tobias commented 6 months ago

Rancher Server Setup

Rancher version: 2.8.2
Installation option (Docker install/Helm Chart): Helm Chart
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): v1.27.12 +rke2r1
Proxy/Cert Details: None/Let's Encrypt

Information about the Cluster

Kubernetes version: v1.27.12+rke2r1
Cluster Type (Local/Downstream): DownStream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Infrastructure Provider(IONOS)

User Information

What is the role of the user logged in? Admin

Describe the bug

The snapshots were working fine until we configured the additional s3 backup in the cluster configuration. Since then in the cluster management/snapshot view the snapshots are displayed with 0B

The problem seems to be that the kube-system/rke2-etcd-snapshots configmap is not updated. On the ETCD Nodes the snapshots are created properly and also the transfer to s3 is working properly. On the ETCD Node i can view the snapshots with rke2 etcd-snapshot list --etcd-s3 and see the local one on the node and the s3 snapshots. The Job on the nodes runs and i can validate in the logs that it ends with exit 0 and no error.

In the Rancher Pods i've a lot of entries like this:

2024/04/16 14:21:12 [INFO] [snapshotbackpopulate] rkecluster fleet-default/mycluster: processing configmap kube-system/rke2-etcd-snapshots
2024-04-16T14:21:12.195356937Z 2024/04/16 14:21:12 [INFO] [snapshotbackpopulate] rkecluster fleet-default/mycluster: processing configmap kube-system/rke2-etcd-snapshots
2024-04-16T14:21:13.153655236Z 2024/04/16 14:21:13 [INFO] [snapshotbackpopulate] rkecluster fleet-default/mycluster: processing configmap kube-system/rke2-etcd-snapshots
2024/04/16 14:21:13 [INFO] [plansecret] Deleting etcd snapshot fleet-default/mycluster-on-demand-mycluster-etcd-z1-d39f123e-xj28d-171327-0cbe1
2024-04-16T14:21:13.400236265Z 2024/04/16 14:21:13 [INFO] [snapshotbackpopulate] rkecluster fleet-default/mycluster: processing configmap kube-system/rke2-etcd-snapshots
2024-04-16T14:21:13.422853081Z 2024/04/16 14:21:13 [INFO] [snapshotbackpopulate] rkecluster fleet-default/mycluster: processing configmap kube-system/rke2-etcd-snapshots
2024/04/16 14:21:14 [INFO] [snapshotbackpopulate] rkecluster fleet-default/mycluster: processing configmap kube-system/rke2-etcd-snapshots
2024-04-16T14:21:17.159319635Z 2024/04/16 14:21:17 [INFO] [plansecret] Deleting etcd snapshot fleet-default/mycluster-on-demand-mycluster-etcd-z3-4c177a8d-wpdrt-171327-ad118
2024-04-16T14:21:17.186841408Z 2024/04/16 14:21:17 [INFO] [snapshotbackpopulate] rkecluster fleet-default/mycluster: processing configmap kube-system/rke2-etcd-snapshots
2024/04/16 14:21:17 [INFO] [snapshotbackpopulate] rkecluster fleet-default/mycluster: processing configmap kube-system/rke2-etcd-snapshots
2024/04/16 14:21:18 [INFO] [snapshotbackpopulate] rkecluster fleet-default/mycluster: processing configmap kube-system/rke2-etcd-snapshots

Here's an example of the an ETCDSnapshot CRD:

apiVersion: rke.cattle.io/v1
kind: ETCDSnapshot
metadata:
  annotations:
    etcdsnapshot.rke.io/snapshot-file-name: on-demand-mycluster-etcd-z1-d39f123e-xj28d-1713271295
    etcdsnapshot.rke.io/storage: local
  creationTimestamp: '2024-04-16T14:31:24Z'
  generation: 1
  labels:
    rke.cattle.io/cluster-name: mycluster
    rke.cattle.io/machine-id: 61aa533c8d2fcf95aabae8dcc45b93b412fcc4106da9453e3d6c6312fc56c9c
  managedFields:
    - apiVersion: rke.cattle.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:etcdsnapshot.rke.io/snapshot-file-name: {}
            f:etcdsnapshot.rke.io/storage: {}
          f:labels:
            .: {}
            f:rke.cattle.io/cluster-name: {}
            f:rke.cattle.io/machine-id: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"2c51fa1e-d281-4e7f-b48d-4a93cbf77b12"}: {}
        f:snapshotFile:
          .: {}
          f:location: {}
          f:name: {}
          f:nodeName: {}
        f:spec:
          .: {}
          f:clusterName: {}
      manager: rancher
      operation: Update
      time: '2024-04-16T14:31:24Z'
    - apiVersion: rke.cattle.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:missing: {}
      manager: rancher
      operation: Update
      subresource: status
      time: '2024-04-16T14:31:24Z'
  name: mycluster-on-demand-mycluster-etcd-z1-d39f123e-xj28d-171327-0cbe1
  namespace: fleet-default
  ownerReferences:
    - apiVersion: cluster.x-k8s.io/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: Machine
      name: mycluster-etcd-z1-69644679dcxjfbbd-lbzvk
      uid: 2c51fa1e-d281-4e7f-b48d-4a93cbf77b12
  resourceVersion: '47848394'
  uid: 3008cfbf-d221-4dc9-ba44-3027324a3e5c
snapshotFile:
  location: >-
    file:///var/lib/rancher/rke2/server/db/snapshots/on-demand-mycluster-etcd-z1-d39f123e-xj28d-1713271295
  name: on-demand-mycluster-etcd-z1-d39f123e-xj28d-1713271295
  nodeName: mycluster-etcd-z1-d39f123e-xj28d
spec:
  clusterName: mycluster
status:
  missing: true

When i ssh on the node the snapshot exists at /var/lib/rancher/rke2/server/db/snapshots/on-demand-mycluster-etcd-z1-d39f123e-xj28d-1713271295 name: on-demand-mycluster-etcd-z1-d39f123e-xj28d-1713271295

I can fix the 0b entries when i add the snapshot data to the kube-system/rke2-etcd-snapshots. I get the data from rke2 etcd-snapshot list --etcd-s3 -o json. But couldn't find any erors related to updating the configmap.

What service/agent is responsible for updating the ConfigMap?

To Reproduce Couldn't reproduce the Issue with another cluster on IONOS.

Result

New Snapshots through the schedule or on demand trigger more 0 byte snapshots in the Cluster Management

Expected Result

New Snapshots are inventoried properly

Screenshots

Additional context

betweenclouds commented 6 months ago

We have the same issue on some downstream clusters. We opened a case with Suse, they pointed us to: https://www.suse.com/support/kb/doc/?id=000021078 but this doesn't helped with our issue. On one of the cluster I was able to fix it with temporary reduce the retention to 3 and the interval to 5min, then wait until rancher cleaned up everything.

RegisHubelia commented 6 months ago

Same as @betweenclouds - tried pretty much everything I could find, but for me my downstream cluster still shows 0bytes even tough the files are there locally and on S3...

RegisHubelia commented 6 months ago

I eventually got it to work. What I ended up doing is manually deleting snapshot files (s3 and local, after backing them up somewhere else). Then I disabled etcd stapshot completly. I then deleted the configmap "rke2-etcd-snapshots" and all of the ETCDSnapshotFile. I then proceeded to reboot all my etcds and control planes 1 after the other. I then Re-enabled the snapshots with a retention of 3 every 5 minutes. I left it as is - and some snapshot were still showing 0 - I went ahead and re-rebooted all the etcds and control planes. After the reboots, it seemed like the consolidation was finally done and all my snapshots are showing correctly now. I re-enabled s3 and all is okay... A bit of a nuke solution, but I wasn't able to make it work any other way...

mueller-tobias commented 6 months ago

The workaround didn't help on my cluster. But thanks for the tipps!

RegisHubelia commented 6 months ago

Actually - I just had that very same issue again on another downstream cluster - and it took me a while but I finally got it to show the snapshots... Just to be clear, let me do a step by step.

1 - Completly deactivate ETCDs snapshots (s3 or local)

2 - Reboot ETCDs and Control planes 1 by 1 ETCDs first then Control planes - or both if you have both on the same nodes.

3 - Remove All ETCDSnapshotFiles - Also remove all the files on S3 and in /var/lib/rancher/rke2/server/db/snapshots might want to keep a few as backup in case something goes wrong

If you see that some snapshots files are stuck at "Removing/Deleting", I ran this command to remove the finalizers on the ETCDSnapshotFiles. kubectl get etcdsnapshotfiles -o json | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name' | while read -r snapshot; do kubectl patch etcdsnapshotfile "$snapshot" --type='json' -p='[{"op": "remove", "path": "/metadata/finalizers"}]' echo "Removed finalizers from $snapshot" done 4 - Remove the "rke2-etcd-snapshots" config map 5 - Reboot Etcds and Control Planes again 6 - Activate the snapshots locally - no S3 for now, I set 10 minutes and 10 retention. 7 - You should see empty snapshots again after some time. Let it do 2 or 3 snapshots schedules 8 - Again, reboot all Etcds then Control PLanes - you should see the following in your logs:

` ApplyJob HelmChart rke2-snapshot-validation-webhook Applying HelmChart using Job kube-system/helm-install-rke2-snapshot-validation-webhook

ApplyJob HelmChart rke2-snapshot-controller Applying HelmChart using Job kube-system/helm-install-rke2-snapshot-controller ` 9 - Check if the "rke2-etcd-snapshots" config map has been recreated after the jobs are finished 10 - after a few minutes, you should start seeing both - sucessful snapshots and older ones empty. 11 - let the retention remove the old empty snapshots 12 - once all snapshots are "sucessful" - reactivate S3 - and from there you should be good.

I'm really not sure why this ends up working, but it did for now 3 downstream clusters. They are all configured the same, so I might have a config that differs, but eventually it seems to be working.

smirnov-mi commented 6 months ago

Same issue in my single-docker-Rancher lab. My downstream RKE2 cluster was created as v1.27-something (latest rke2 available on March 12.) with Rancher 2.8.2, on-demand backups and automatic etcd-snapshots were working fine until today. I have recently patched and rebooted my server with the rancher instance and upgraded Rancher to v2.8.3. The cluser was then updated to v1.28.8+rke2r1, It's been a few days and automatic snapshots were working fine. Started this morning, new on-demand backups and automatic etcd-snapshots are "0" sized and can not be addressed to, even though existent on the servers.

I've just created a new RKE2 downstream cluster (v1.28.8+rke2r1, using Rancher v2.8.3) and I have no issues with on-demand backups or automatic etcd snapshots. Also no issues after restarting Rancher container. The older cluster still shows "0" on recent and new backups.

atsai1220 commented 6 months ago

Could be related to:

Our one-liner to patch each cluster until fix is available:

kubectl get lease -n kube-system rke2-etcd -o jsonpath='{.spec.holderIdentity}' | xargs -I {} kubectl patch lease -n kube-system  --patch='{"spec":{"holderIdentity":"{}"}}' --type=merge rke2; kubectl get lease -n kube-system -o custom-columns='name:.metadata.name,holder:.spec.holderIdentity' | grep rke2

k8s-doomzday commented 4 months ago

I'm having an odd issue similar to this as well. We recently upgrade 2 of our clusters from Rancher Version 2.7.4 --> 2.8.2 and Kubernetes versions 1.25.9 --> 1.27.13/14. Snapshots are now only visible in the Rancher UI for 1 of our 3 control nodes in each of our 4 clusters. I can check the logs on the 2 other control nodes for each cluster and see the snapshots are taking but they are not uploading into the Rancher UI, they are only upload for a single control node.

Additionally, if I select an On-Demand snapshot, the UI uploads 3 snapshots from a single control node and on the node itself, I see 3 "on-demand" snapshots. The 2 other control nodes do not appear to acknowledge the on-demand request

RegisHubelia commented 4 months ago

Yhea this has been an annoyance... the only thing that 100% fixed it for me was to upgrade to 1.28.9. I had issues with 1.26 at some point, upgrading to 1.27 didn't change anything - same issues, but all my clusters that were upgraded to 1.28.9 self-resolved...

k8s-doomzday commented 4 months ago

Yhea this has been an annoyance... the only thing that 100% fixed it for me was to upgrade to 1.28.9. I had issues with 1.26 at some point, upgrading to 1.27 didn't change anything - same issues, but all my clusters that were upgraded to 1.28.9 self-resolved...

Thanks for the heads up! Definitely a huge annoyance for us

nickvth commented 4 months ago

We still have the same issue with rke2 1.28.9. I think related to: https://github.com/rancher/rke2/issues/5866 and wait till v1.28.11

RegisHubelia commented 4 months ago

Try removing the snapshots on all etcds and the snapshot configmap - disable auto snapshot - re-enable (without s3), wait for a few scheduled snapshots, they should show up eventually. When I upgraded to 1.28, I had already disabled the auto snapshots and deleted the etcd snapshots on the nodes. Hopefully this resolves your issue. If you have s3 snapshots, delete them also. Make sure to keep a few as a backup.

nickvth commented 4 months ago

@RegisHubelia thanks for your procedure(handy for other people), but I already did that and still after some time we get snapshots with 0 bytes. We have more then 50 clusters and it's not nice if you must do this procedure for each cluster. Also occur on clean new installed clusters with 1.28.9.

atsai1220 commented 4 months ago

We have a corn job running this on each downstream rke2 until it’s ultimately fixedetcd snapshots showing 0B size in the Rancher UI |...suse.comOn Jun 20, 2024, at 10:09 AM, nickvth @.***> wrote: @RegisHubelia thanks for your procedure, but i already did that and still after some time you get snapshots with 0 bytes. Also occur on clean new installed clusters with 1.28.9.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

atsai1220 commented 4 months ago

We have a cron job iterating through each rke2 downstream cluster updating the lease objects

https://www.suse.com/support/kb/doc/?id=000021447

RegisHubelia commented 4 months ago

Yhea - I can confirm, even tough it fixes it - after a while, the issue is back... That workaround kubectl get lease -n kube-system rke2-etcd -o jsonpath='{.spec.holderIdentity}' | xargs -I {} kubectl patch lease -n kube-system --patch='{"spec":{"holderIdentity":"{}"}}' --type=merge rke2; kubectl get lease -n kube-system -o custom-columns='name:.metadata.name,holder:.spec.holderIdentity' | grep rke2 - seems to work. Thanks @atsai1220

Velociraptor85 commented 4 months ago

seems related

rancher / rancher

[BUG] ETCD Backup for downstream cluster not properly inventoried #45141