vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.78k stars 1.41k forks source link

CSI Artifacts are not patched in object store after Finalizing Phase #7979

Open anshulahuja98 opened 4 months ago

anshulahuja98 commented 4 months ago

What steps did you take and what happened:

In the finalizing phase today the backup controller re uploads the backup TarBall. (https://github.com/vmware-tanzu/velero/blob/1ec52beca80975f74f9ed28d6f9c5f7afe67edee/pkg/backup/backup.go#L756) But it does not update CSI related artifacts in the object store. The CSI gzips with VolumeSNapshotCOntent, VolumeSnapshot etc.

Velero in the CSI plugin BIAv2 implementation does a cleanup of the VolumeSnapshot & recreates VolumeSnapshotContent after the backup goes into finalizing phase. https://github.com/vmware-tanzu/velero/blob/28d64c2c529f33510a68200c129012a163777a67/pkg/util/csi/volume_snapshot.go#L633-L636

Given this behavioural gap in velero, the object store is not updated with this recreated VolumeSnapshotContent as the contents are not re uploaded.

This has lead to other behavioural issues in Velero as highlighted in Issue - #7978

What did you expect to happen:

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

reasonerjt commented 4 months ago

@anshulahuja98 @blackpiglet This is essentially the reason for #7978, right?
I see we are discussing whether we can skip uploading vsc to BSL and modify the deletion/restore process, if we can reach an agreement this is a good candidate for v1.15 IMO

anshulahuja98 commented 4 months ago

Yes @reasonerjt This is just the rootcause Bug item

And yes I am in favour of removing dependency on VSC for the various flows, we can plan for 1.15

anshulahuja98 commented 4 months ago

https://github.com/vmware-tanzu/velero/issues/7978#issuecomment-2222257681 Link to another explanation of the issue

reasonerjt commented 2 months ago

Per discussion, the effort to resolve this one is relatively large, I wanna propose this to be deferred.

haslersn commented 1 month ago

This issue might be related to the following problem:

We have multiple K8s clusters, each with a local CephFS and Velero installed. We perform hourly backups without data movement (but with CSI snapshots) and daily backups with data movement. The Velero instances can see (in S3) the backups from the other Velero instances. This leads to the following problem:

Velero sees a hourly backup from a different Velero instance in S3 and (because this is a backup without data movement) tries to create a corresponding VolumeSnapshotContent in the local cluster. This fails, because the snapshot is only accessible from the cluster where the backup was taken.

This leads to a huge number of pending VolumeSnapshotContents which also impacts reconciliation of other "legit" snapshot operations in a "Denial of Service" style.