vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.6k stars 1.39k forks source link

OADP expired backup delete requests keeps repeating over and over #6549

Open Chin2691 opened 1 year ago

Chin2691 commented 1 year ago

Describe the problem/challenge you have Backup that are expired with the backup status " FailedValidation" due to the storage location not existing are failing to delete. Velero keeps adding these delete backup requests backup "delete queue" and trying to delete non stop in an infinite loop.

Describe the solution you'd like Backups with "Failed Validation" status because the storage location "aws" doesn't exist, which are also 60+ days old needs to be deleted automatically. Velero tries deleting these backups with failed validation over and over and over again in a never ending infinite loop.

time="2023-06-21T07:55:18Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=gc error="error getting backup storage location: BackupStorageLocation.velero.io \"aws\" not found" error.file="/remote-source/velero/app/pkg/controller/gc_controller.go:165" error.function="github.com/vmware-tanzu/velero/pkg/controller.(*gcController).processQueueItem" key=openshift-adp/ose-gitstarted-dev-hourly-20230408130721 logSource="/remote-source/velero/app/pkg/controller/generic_controller.go:140"
time="2023-06-21T07:55:19Z" level=info msg="Backup has expired" backup=openshift-adp/vlp-policy-test-dev-hourly-20230406082118 controller=gc expiration="2023-04-13 08:21:18 +0000 UTC" logSource="/remote-source/velero/app/pkg/controller/gc_controller.go:145"
time="2023-06-21T07:55:19Z" level=warning msg="Backup cannot be garbage-collected because backup storage location aws does not exist" backup=openshift-adp/vlp-policy-test-dev-hourly-20230406082118 controller=gc expiration="2023-04-13 08:21:18 +0000 UTC" logSource="/remote-source/velero/app/pkg/controller/gc_controller.go:157"
time="2023-06-21T07:55:19Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=gc error="error getting backup storage location: BackupStorageLocation.velero.io \"aws\" not found" error.file="/remote-source/velero/app/pkg/controller/gc_controller.go:165" error.function="github.com/vmware-tanzu/velero/pkg/controller.(*gcController).processQueueItem" key=openshift-adp/vlp-policy-test-dev-hourly-20230406082118 logSource="/remote-source/velero/app/pkg/controller/generic_controller.go:140"
time="2023-06-21T07:55:19Z" level=info msg="Backup has expired" backup=openshift-adp/cpm-mtb-infrastructure-performance-analytics-dev-hourly-20230406161419 controller=gc expiration="2023-04-13 16:14:19 +0000 UTC" logSource="/remote-source/velero/app/pkg/controller/gc_controller.go:145"
time="2023-06-21T07:55:19Z" level=warning msg="Backup cannot be garbage-collected because backup storage location aws does not exist" backup=openshift-adp/cpm-mtb-infrastructure-performance-analytics-dev-hourly-20230406161419 controller=gc expiration="2023-04-13 16:14:19 +0000 UTC" logSource="/remote-source/velero/app/pkg/controller/gc_controller.go:157"
time="2023-06-21T07:55:19Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=gc error="error getting backup storage location: BackupStorageLocation.velero.io \"aws\" not found" error.file="/remote-source/velero/app/pkg/controller/gc_controller.go:165" error.function="github.com/vmware-tanzu/velero/pkg/controller.(*gcController).processQueueItem" key=openshift-adp/cpm-mtb-infrastructure-performance-analytics-dev-hourly-20230406161419 logSource="/remote-source/velero/app/pkg/controller/generic_controller.go:140"
time="2023-06-21T07:55:19Z" level=info msg="Backup has expired" backup=openshift-adp/nao-cdr-microservice-api-dev-hourly-20230407231220 controller=gc expiration="2023-04-14 23:12:20 +0000 UTC" logSource="/remote-source/velero/app/pkg/controller/gc_controller.go:145"

Backup with Validation failed: image

Anything else you would like to add:

Environment:

Velero version (use velero version): we are using OADP operator version 1.1.5, velero 1.9.5v Kubernetes version - 1.23 Kubernetes installer & version: openshift 4.10v Cloud provider or hardware configuration: - Vsphere (vmware) - 6.7 update 3 OS (e.g. from /etc/os-release): whatever coreos version for Openshift 4.10.55

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

sseago commented 1 year ago

The problem with deleting them automatically is that if the BSL is only temporarily missing (i.e. it was deleted, will be recreated, or it's currently inaccessible for some network-related reason), then deleting the backup from the cluster won't actually clean up the backup itself, since the BSL is inaccessible. Then, if it returns later, it will be recreated in the cluster. There may be scenarios where it's desirable to delete it locally even though Velero can't do cleanup, but there may be others where this is not wanted.

As a workaround, you could manually delete the backups. In terms of any future velero changes to delete backups without BSL access (ignoring the rest of the cleanup), we'd probably want to make this configurable via a velero server arg if we decided to implement it.

reasonerjt commented 1 year ago

These deletions are introduced by the gc controller which tries to delete the backup, so we may want to introduce "max-retry" to gc_controller and make it count the retries, once a backup is failed to be deleted by the gc controller we can delete the CR directly, either via introducing some new field in backup deletion request CR or delete it in gc controller directly.

We probably don't need to worry about the orphan data in the bucket b/c the CR will be synced by the sync controller.

sseago commented 1 year ago

Also, note that from the messages, backup deletion requests are not being created. Those log warnings are telling you that velero is not creating a BackupDeletionRequest because the backup location doesn't exist, so it can't delete backup contents.

Chin2691 commented 2 months ago

@reasonerjt any updates on this issue ?