Force deletion of backups after X time

vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes

https://velero.io

Apache License 2.0

8.76k stars 1.41k forks source link

Force deletion of backups after X time #6091

Open GeiserX opened 1 year ago

GeiserX commented 1 year ago

Describe the problem/challenge you have I have to routinely clear up expired backups that failed to be removed, usually because due to a PartiallyFailed state with some ugly code like velero get backup | grep ago | awk '{ print $1 }' | tr '\n' ' ' | xargs velero delete backup --confirm

Describe the solution you'd like A hard expiry should be configurable through parameters. So, for instance, clear all velero backups after 1 week. If failing, force delete with a kubectl delete backup ....

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "The project would be better with this feature added"
:-1: for "This feature will not enhance the project in a meaningful way"

danfengliu commented 1 year ago

Hi @DrumSergio,

Do you mean there're backups in "PartiallyFailed" phase will stay in backup list forever without manual interference? Is the following 2 configs helps on your situation? --garbage-collection-frequency is provided since v1.9.

velero backup/schedule  --ttl duration                                       How long before the backup can be garbage collected.
velero install  --garbage-collection-frequency duration      How often the garbage collection runs for expired backups.(default 1h)

If expired backups were not removed by GC as GC runs, we have to dig into this issue first, then we can think about the new treatment on this requirement. Could you confirm my question first?

GeiserX commented 1 year ago

Hi @danfengliu Thanks for your prompt response.

Yes, we are using TTL options. This is usually respected. But sometimes we have various clusters where we have Completed backups which are expired (So the TTL has been exceeded, it appears in the EXPIRY column like 10d ago). When I try to delete them using the Velero CLI, they are sometimes left, so I have to resort to kubectl delete backup ... And sometimes they are stubborn, still appearing when interacting with the Velero CLI.

And we are using the default value provided in the Velero chart (which I have checked and it's 1h by default). I did not know about this feature. But we routinely have a lot of backups dangling forever because of a PartiallyFailed, Failed and similar error messages, which I don't remember now.

If you want me to help you debug this, give me some instructions and let's wait for some time for them to appear again.

danfengliu commented 1 year ago

Thanks for the detailed feedback！ As we know, using "kubectl delete backup" CLI is not a good practice for deleting Velero backups, because that will cause orphan data left in object store. Could you provide Velero version and basic information of plugin, cluster and object store？ And t's better to have Velero server pod logs trigged by Velero delete CLI operation which failed to delete some of the target backups. Before we having these information, I will try to produce this issue first.

danfengliu commented 1 year ago

Hi @DrumSergio, Here are some tips on Velero backups mamangement:

"kubectl delete backup" CLI is not a good practice, Velero server will keep sync local deleted backups from object store to local again periodly, because "kubectl delete backup" CLI is out of control of Velero, Velero backup deletion process will still running as it's own logic;
Anytime we fail to delete backups by Velero CLI, we should identify the root cause of the failures, it maybe caused by credential problem, Velero bugs or others, any kind of problem should be solved instead of deleting them forcibly. So requirement for deleting backups forcibly( ignore errors) is not in a strong position with current known issues.

GeiserX commented 1 year ago

Thanks @danfengliu Perhaps this should be off by default. I'd really like to have this option set up in my cluster. Meanwhile I'll be investigating them more thoroughly to know what's happening.