vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.7k stars 1.4k forks source link

Unable to delete failed backups #4626

Closed brovoca closed 5 months ago

brovoca commented 2 years ago

What steps did you take and what happened:

Backups have been running as normal until we added a lot of small files to our PVs. This has caused the AKS nodes to OOMKill Restic while running the jenkins schedule. The OOMKill issue is being investigated with Azure, but yet still, the backups can not be deleted. Deletion of e.g. jenkins-20220205003047 has been attempted, but it's stuck. I've restarted the velero deployment and restic daemonset, but it didn't help.

~ >>> velero get backups
NAME                        STATUS            ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
jenkins-20220209003046      PartiallyFailed   4        0          2022-02-09 01:30:46 +0100 CET   29d       default            <none>
jenkins-20220208003039      PartiallyFailed   4        0          2022-02-08 01:30:39 +0100 CET   28d       default            <none>
jenkins-20220207163813      PartiallyFailed   4        0          2022-02-07 16:38:13 +0100 CET   28d       default            <none>
jenkins-20220207003049      PartiallyFailed   4        0          2022-02-07 01:30:49 +0100 CET   27d       default            <none>
jenkins-20220206003048      PartiallyFailed   4        0          2022-02-06 01:30:48 +0100 CET   26d       default            <none>
jenkins-20220205003047      Deleting          4        0          2022-02-05 01:30:47 +0100 CET   25d       default            <none>
jenkins-20220204003046      PartiallyFailed   2        0          2022-02-04 01:30:46 +0100 CET   24d       default            <none>
jenkins-20220203003045      PartiallyFailed   2        0          2022-02-03 01:30:45 +0100 CET   23d       default            <none>
jenkins-20220202003044      Completed         0        0          2022-02-02 01:30:44 +0100 CET   22d       default            <none>
...
monitoring-20220203000045   Completed         0        0          2022-02-03 01:00:45 +0100 CET   15h       default            <none>
monitoring-20220202000044   Deleting          0        0          2022-02-02 01:00:44 +0100 CET   8h ago    default            <none>
monitoring-20220201000043   Deleting          0        0          2022-02-01 01:00:43 +0100 CET   1d ago    default            <none>

What did you expect to happen:

Backup is deleted :-)

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

jenkins-20220209003046-bundle-2022-02-09-09-43-38.tar.gz

The backups monitoring-20220202000044 and monitoring-20220201000043 were successful, but yet fail to delete. Unable to obtain bundle for those + jenkins-20220205003047 since the logs have been deleted:

~/tmp >>> velero debug --backup monitoring-20220201000043
2022/02/09 09:43:21 Collecting velero resources in namespace: velero
2022/02/09 09:43:22 Collecting velero deployment logs in namespace: velero
2022/02/09 09:43:23 Collecting log and information for backup: monitoring-20220201000043
An error occurred: exec failed: Traceback (most recent call last):
  velero-debug-collector:27:20: in <toplevel>
  velero-debug-collector:7:22: in capture_backup_logs
  <builtin>: in capture

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Client:
        Version: v1.7.0
        Git commit: 9e52260568430ecb77ac38a677ce74267a8c2176
Server:
        Version: v1.7.1
'Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-18T19:30:35Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 2 years ago

It seems Restic backup is incremental. They need previous backups to work. I'm not sure this causes the backups cannot be deleted.

brovoca commented 2 years ago

@blackpiglet some of our restic backups can be deleted, others not. Also, I know that borg backup has no problem deleting any backup, even if all are incremental.

Lyndon-Li commented 5 months ago

Closing this issue as this is for an old version of Velero and there is no update for long time. Please try with the latest version of Velero under the Kopia path, feel free to reopen it if the problem could be reproduced.