vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.78k stars 1.41k forks source link

Got backup PartiallyFailed result when backing up PVCs which are not used by any pod #7233

Closed danfengliu closed 8 months ago

danfengliu commented 11 months ago

Describe the problem/challenge you have

Backup namespace contains PVCs which not in used by any pod, then got PartiallyFailed result.

k get pvc -n azure-csi-test
NAME               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nginx-logs-e2e-1   Bound     pvc-e4dda753-8c88-4194-93b5-959a01be1de4   1Gi        RWO            managed-csi    16h
nginx-logs-e2e-2   Pending                                                                        managed-csi    16h
nginx-logs-e2e-3   Pending                                                                        managed-csi    16h
nginx-logs-e2e-4   Pending                                                                        managed-csi    16h
nginx-logs-e2e-5   Pending                                                                        managed-csi    16h

velero describe backup backup-csi-6 --details
Name:         backup-csi-6
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.28.0
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=28

Phase:  PartiallyFailed (run `velero backup logs backup-csi-6` for more information)

Errors:
  Velero:    name: /nginx-logs-e2e-2 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=azure-csi-test, name=nginx-logs-e2e-2): rpc error: code = Unknown desc = PVC azure-csi-test/nginx-logs-e2e-2 has no volume backing this claim
             name: /nginx-logs-e2e-3 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=azure-csi-test, name=nginx-logs-e2e-3): rpc error: code = Unknown desc = PVC azure-csi-test/nginx-logs-e2e-3 has no volume backing this claim
             name: /nginx-logs-e2e-4 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=azure-csi-test, name=nginx-logs-e2e-4): rpc error: code = Unknown desc = PVC azure-csi-test/nginx-logs-e2e-4 has no volume backing this claim
             name: /nginx-logs-e2e-5 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=azure-csi-test, name=nginx-logs-e2e-5): rpc error: code = Unknown desc = PVC azure-csi-test/nginx-logs-e2e-5 has no volume backing this claim

Describe the solution you'd like Warning should be enough to let user notice this workload might have issue or not.

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 11 months ago

I agree we should do better in this scenario. There are some similar cases to this. Let's settle on a unified solution to all of them. IMO, we should consider the k8s resources as crucial for the Velero, such as Pod, PV, and PVC.

First, the potential errors should be converted to warnings. Second, need to consider whether the volumes should be tracked by the skipped PV trackers.

hsinhoyeh commented 10 months ago

hi team, thanks for creating this backup/restore tooling. Unfortunately, we encountered this issue when our serverless applications relied on an RWX mode PVC. We choose to do a backup at midnight, where the traffic is low and it also keeps the running pod minimized to zero. But, the backup didn't cover our PVCs for serverless :(

blackpiglet commented 10 months ago

@hsinhoyeh Could you give more information about your scenario? Do you know if you use the Filesystem or volume snapshot backup? Is the PVC mounted by multiple pods when the backup is in progress?

hsinhoyeh commented 10 months ago

@hsinhoyeh Could you give more information about your scenario? Do you know if you use the Filesystem or volume snapshot backup? Is the PVC mounted by multiple pods when the backup is in progress?

Hi @blackpiglet we use file system for backup. the PVC is supposed to be mounted by multiple pods (with mode: RWM). having say that, our multiple pods are mostly read from the PVC (during backup), not writing it.

blackpiglet commented 10 months ago

@hsinhoyeh Thanks for the feedback. Could you share the backup command or the backup CR YAML?

blackpiglet commented 9 months ago

If there is no pod mounting the PVC when a backup is ongoing, the file-system backup cannot cover the PVC, because the file-system uploader needs to read the PVC's volume data by the mounting directory for the pod on the k8s node. Please read the PodVolumeBackup description to understand how it works: https://velero.io/docs/v1.13/file-system-backup/#custom-resource-and-controllers.

For your scenario, if the PVC's volume supports the snapshot function, then we can use snapshot to back up the data.

reasonerjt commented 9 months ago

This is working as expected I don't think we wanna change the error into a warning, which will be a breakchange.

stp-bsh commented 4 months ago

Is there any way to exclude PVCs with unbound PVs? As long as this is not the case I would see this as as a Warning and not as Error.