Velero v1.12.3 Fail to Ignore Resources in Terminating Phase

nwakalka commented 4 months ago

What steps did you take and what happened:

Running a suite of E2E test cases, where we are testing regular backups and restore as well as cluster backups.
As soon as regular backup and restore is completed, ( where namespace is in termination phase) and cluster backup is triggered, which backed up terminating namespace.
Upon executing Velero for cluster backup, it failed to exclude resources, specifically namespaces, that were in the terminating phase.

What did you expect to happen:

Expecting to Ignore Resources which are terminating phase.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero

[root@runner-jgnwu6xf-project-14702-concurrent-0 tmp]# kubectl exec -it mcs-velero-69b6f59bdc-tr7p9 -c mcs-velero -n mcs-backup -- /velero backup logs cb-e2e-klu-tgphvd --insecure-skip-tls-verify|grep level=error
time="2024-04-26T09:55:26Z" level=error msg="Error backing up item" backup=mcs-backup/cb-e2e-klu-tgphvd error="error getting persistent volume claim for volume: persistentvolumeclaims \"e2eapp-pv-claim-new\" not found" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/podvolume/backupper.go:218" error.function="github.com/vmware-tanzu/velero/pkg/podvolume.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:448" name=new-label-app-845dbc7d96-t7h46
time="2024-04-26T09:55:27Z" level=error msg="Error backing up item" backup=mcs-backup/cb-e2e-klu-tgphvd error="error getting persistent volume claim for volume: persistentvolumeclaims \"e2eapp-pv-claim\" not found" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/podvolume/backupper.go:218" error.function="github.com/vmware-tanzu/velero/pkg/podvolume.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:448" name=label-app-585cccb667-tjbtn
time="2024-04-26T09:55:28Z" level=error msg="Error backing up item" backup=mcs-backup/cb-e2e-klu-tgphvd error="error getting persistent volume claim for volume: persistentvolumeclaims \"e2eapp-pv-claim-new\" not found" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/podvolume/backupper.go:218" error.function="github.com/vmware-tanzu/velero/pkg/podvolume.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:448" name=new-label-app-845dbc7d96-t7h46
[root@runner-jgnwu6xf-project-14702-concurrent-0 tmp]# kubectl exec -it mcs-velero-69b6f59bdc-tr7p9 -c mcs-velero -n mcs-backup -- /velero backup describe cb-e2e-klu-tgphvd

velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:

Steps Taken:

Initiated a cluster backup using Velero.
During the backup process, observed that an old namespace and its associated pod were in the terminating phase.
Velero attempted to resolve the pod and identified a Persistent Volume (PV) mount associated with it.
Velero then attempted to grab the Persistent Volume Claim (PVC) referenced in the PV.
However, the attempt to grab the PVC failed because the namespace associated with it had already been terminated and was no longer available.

What Happened:

The cluster backup was initiated while certain resources, including a namespace and its associated pod, were still in the terminating phase. Velero proceeded with the backup process and attempted to resolve the resources. It successfully identified the PV mount associated with the pod but encountered a failure when attempting to grab the PVC referenced in the PV. This failure occurred because the namespace, to which the PVC belonged, had already been terminated by that time. As a result, Velero was unable to complete the backup process for the PVC, leading to potential inconsistencies in the backup data.

Environment:

Velero version (use velero version): v1.12.3

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

qiuming-best commented 4 months ago

If you back up one deleting resources namespace, the error reported by Velero is as expected, we should not ignore the errors

blackpiglet commented 4 months ago

I agree with @qiuming-best. The reason is Velero cannot understand the k8s resource's dependency. Velero collects the backup k8s resources by the alphabet order in most cases.

As a result, Velero can skip the resources already having a Deletion Timestamp, but it cannot understand the namespace-scoped resource's namespace's Deletion Timestamp meaning.

vmware-tanzu / velero

Velero v1.12.3 Fail to Ignore Resources in Terminating Phase #7777