[BUG] can't get rid of failed cluster snapshot backups on S3, seriously affecting Rancher's internal communication

harridu commented 4 months ago

Rancher Server Setup

Rancher version: 2.8.4
Installation option (Docker install/Helm Chart): Helm Chart
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2 1.28.9
Proxy/Cert Details:

Information about the Cluster

Kubernetes version: RKE2 1.28.9
Cluster Type (Local/Downstream): local

User Information

What is the role of the user logged in? Admin/Cluster Owner

Describe the bug I had configured hourly snapshots for a managed cluster, including backup on S3 (a local minio installation). After some months I tried to disable the snapshot backup in S3, using the radio button in Rancher. That was version 2.8.2, maybe older. The minio storage was taken offline.

Since then Rancher has added 3800 error entries about broken backups on the non-existing S3 storage to the cluster spec:

:
        "state": "active",
        "message": "Resource is current"
      },
      {
        "toId": "fleet-default/extkube001-etcd-snapshot-node01.dmz.aixigo.de-1708488004-s3",
        "toType": "rke.cattle.io.etcdsnapshot",
        "rel": "owner",
        "state": "active",
        "message": "Resource is current"
      },
      {
        "toId": "fleet-default/extkube001-etcd-snapshot-node01.dmz.aixigo.de-1708617605-s3",
        "toType": "rke.cattle.io.etcdsnapshot",
:

They have reached a size of appr 2 MByte (verified using the debugging tools is Google Chrome). This results in "413 Request Entity Too Large" errors, and breaks several features in Rancher. Esp I cannot reconfigure the cluster anymore, or trigger a manual snapshot. See the attachments.

I could edit the cluster.yaml file to manually kick out the s3 backup. Since then it is not creating new error entries anymore, but the old error entries are not deleted, either, even though the oldest are 162 days old. The snapshots are supposed to be kept for 4 days.

Expected Result

Rancher should have stopped creating S3 backups, as configured in the GUI.
The useless error messages should not be included into the cluster specs used in some internal communication between Rancher and the managed cluster.
Error entries about snapshots should be automatically deleted after some retention time. If snapshots are to be kept for 4 days, then the retention period could be 8 days, for example.

Screenshots

snapshot k3gvqk snapshot jh3PT5 snapshot bbmhVI

Additional context

harridu commented 3 months ago

anybody, please? The upgrade to 2.8.5 did not make the zombie entries disappear.

harridu commented 3 months ago

@brandond , would you consider this an issue with rke2?

brandond commented 3 months ago

Not entirely, no.

RKE2 and Rancher do not assume that disabling S3 when it was previously enabled, means that the snapshots are no longer available on S3. If you disable it, it will stop putting new snapshots there, and it won't look on S3 when reconciling the list on startup, but it at no point forgets about any snapshots that were previously sent to S3. This was not something that we expected folks to want.

If you want to get rid of all the S3 snapshots, you should manually delete them prior to disabling S3. Otherwise we assume that they are still out there.

We could probably work on some sort of enhancement to allow disabling S3 and also forgetting about all the snapshots that were uploaded there, but that's not currently possible.

harridu commented 2 months ago

If you want to get rid of all the S3 snapshots, you should manually delete them prior to disabling S3. Otherwise we assume that they are still out there.

@brandond , I am sure this is very well documented, but the problem is, the S3 storage used to copy the backups to doesn't exist anymore. It was an on prem minio server in the same subnet, but its gone now. Not to mention that I get error messages about a giant list of backups with 0 bytes length that were not copied to S3, and that don't exist on a local file system anymore, either.

We could probably work on some sort of enhancement to allow disabling S3 and also forgetting about all the snapshots that were uploaded there, but that's not currently possible.

Is there some way to manually clean up this mess? As written above, I get error messages all over the place for this cluster.

brandond commented 2 months ago

You could try enabling S3 against another empty bucket, and then allowing it to sync the snapshot list? When it sees that S3 is enabled but there are no snapshots there, it should clean everything up.

I know RKE2 will do that at least, I am not sure about the rancher side.

harridu commented 2 months ago

@brandond , I don't get that far. Trying to enable S3 again I get an error in rancher, as soon as I click on [Save]. snapshot GZMUy2 Shouldn't it clean up locally and look for remaining snapshots to copy first, before connecting to S3?

harridu commented 2 months ago

@brandond , as you suggested I have configured a new S3 bucket for backup. The red error popup in Rancher is gone, I can save the cluster config again. Most of the bad backups of the last 216 days went away, but there are still a few hundred 0 Byte entries on the snapshots page, all between 53 and 59 days old. Error message for these entries is a "connection refused".

The most recent snapshots on S3 are listed in minio browser, but on the Rancher snapshot page they still show up in red with 0 bytes and ECONREFUSED as well. I don't have the impression that it works as expected yet. Ain't there really no other way to cleanup than configuring an unwanted S3 storage? I have to disable S3 backups, and I am concerned I will run into the same problem again.

BTW, rke2 is version 1.28.11 now.

rancher / rancher

[BUG] can't get rid of failed cluster snapshot backups on S3, seriously affecting Rancher's internal communication #45664