vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.57k stars 1.39k forks source link

Snapshot data movement restore does not fully work with StorageClass with binding mode WaitForFirstConsumer #7561

Open Elias-elastisys opened 5 months ago

Elias-elastisys commented 5 months ago

What steps did you take and what happened:

I've been trying out the Volume Snapshot Data movement for backups and restores to s3, specifically in order to be able to backup PVs that have no currently associated running Pod, as this is not possible with regular Velero backups.

When doing a restore of a successful backup in a cluster with a StorageClass with binding mode WaitForFirstConsumer the restore will time out since Velero waits until a PV is provisioned before it creates its helper pod that facilitates the restore. But the storage provider waits until a Pod is attached to the PVC until it provisions storage, so the restore essentially deadlocks.

If I manually create a Pod that attaches to the restored PVC the restore eventually succeeds.

What did you expect to happen: To be able to restore a backup successfully, even if the backup contains PV/PVCs without any attached Pod.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

Anything else you would like to add:

I found this old "prioritized" issue: https://github.com/vmware-tanzu/velero/issues/2971 with the same problem but for Restic, while volume snapshot movement uses Kopia. This doesnt seem to have been updated or made any progress in over 2 years. Any updates?

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 5 months ago

This is by design. The CSI snapshot data mover restore is similar to the filesystem restore or PodVolumeRestore. The Velero node-agent needs to let the volume settle down on a node, then tries to write the data into the pod's volume-mounting directory.

Lyndon-Li commented 5 months ago

@Elias-elastisys This is by the design of Velero and Kubernetes for volumes with WaitForFirstConsumer as the binding mode.

And Velero snapshot data movement is not tested against the case of backing up volumes without pod, so is not officially supported. Could you describe more about your case? Why there are many volumes to be backed up but without attaching to pods? You use case would help us to prioritize our work and include this support into the future plan.

Elias-elastisys commented 5 months ago

This is by design.

Alright thanks, unfortunate for my test case but it makes sense.

And Velero snapshot data movement is not tested against the case of backing up volumes without pod, so is not officially supported. Could you describe more about your case? Why there are many volumes to be backed up but without attaching to pods?

It is to truly backup all data in a cluster. If you run CronJobs or Jobs with PVs then there might be cases where backups run when the Job is not running. In that case the regular Velero backups will not get that data.

Now as I said, CSI data movement seems to be able to successfully backup data without Pods, even with WaitForFirstConsumer since the PV is already there. The only issue is that the restore requires manual intervention which I guess is not the biggest issue, since it succeeds if you apply the Pod manually.

But it would of course be greatly appreciated if it would be possible to catch this edge case as well.

Lyndon-Li commented 5 months ago

We had a discussion about this issue, here is the conclusion:

  1. This is a valid case, but not sure how many users require this.
  2. The problem is not only for data mover restore, but also for other restore types, i.e., CSI snapshot restore and native snapshot restore. The difference is data mover restore blocks the restore, while the others just let it go, but the restored PV may not able to be bound to the original pod since WaitForFirstConsumer rules are not applied
  3. To fix the problem, we can adjust the constraint data mover restore is doing, that is, we can add an option for users to specify for which volumes data mover restore must wait for the schedule of pod so as to apply the WaitForFirstConsumer rules; or when the wait for schedule of pod timeouts, we let it go and ignore WaitForFirstConsumer.
  4. However both ways have side-effects --- the user option may be misused; and if we let it go on timeout, the restored PV may not able to be bound to the original pod
  5. Therefore, we will put this issue into backlog so that we can collect enough requirements and inputs on this
blackpiglet commented 5 months ago

Add some test results here. If the unmounted volume is backed up by the CSI plugin, after restoration, the PVC hangs in the pending state, until the PVC is mounted by some pod manually.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

Lyndon-Li commented 1 month ago

8044 is opened for an enhancement of this case