vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.4k stars 1.37k forks source link

failed to restore volume with StorageClass, claim Selector is not supported #7946

Open soostdijck opened 5 days ago

soostdijck commented 5 days ago

What steps did you take and what happened: We have a setup where the NFS CSI driver creates the PV's dynamically once the PVC's are created/restored. This is done by specifying the correct storage classes.

However, when Velero backs up the PVC's, it adds a selector that breaks the PV creation by the NFS driver:

  selector:
    matchLabels:
      velero.io/dynamic-pv-restore: <pvc-name>.87z5b

What did you expect to happen: We expect the restore to happen without Velero adding extra selectors that break the dynamic PV creation.

The output of the following commands will help us better understand what's going on:

velero backup create BACKUP_NAME --include-namespaces NAMESPACE --snapshot-move-data --snapshot-volumes --include-resources pvc

velero restore create --from-backup BACKUP_NAME

Environment:

Velero helm chart 6.4.x, Velero version 1.13.2 Kubernetes version 1.27

Note, this is a duplicate of this issue on the helm chart, but I think it belongs here

kaovilai commented 5 days ago

it adds a selector that breaks the PV creation by the NFS driver

Sounds like a faulty NFS driver if it can't handle user (or velero) added labelSelector.

soostdijck commented 5 days ago

it adds a selector that breaks the PV creation by the NFS driver

Sounds like a faulty NFS driver if it can't handle user (or velero) added labelSelector.

It seems very unlikely to me that something as large and common as csi nfs would be "faulty".

Lyndon-Li commented 5 days ago

This is done purposefully as/expected by Velero data mover restore workflow. After Velero data mover restore completes, the restored PV will be bound to this PVC. Or in another words, this PVC can only be bound by Velero data mover restore.

If you don't see the binding happens, it means the data mover restore doesn't complete. Then you can get the corresponding DataDownload CR to see the progress by kubectl get datadownload -n velero

kaovilai commented 5 days ago

@soostdijck can you link docs that indicate label selector cannot be added?

StellaV commented 5 days ago

Hi @Lyndon-Li and @kaovilai

Thanks for the quick replies!

I think there's one confusion about how we use the NFS driver. We do not back up the PV's, as we use a storage class that dynamically creates them when a PVC is added. This is where it goes wrong. The dynamic PV's cannot be created due to the selector added by Velero, resulting in the error "failed to restore volume with StorageClass, claim Selector is not supported".

Here's an example of how we did it:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    argocd.argoproj.io/sync-options: Delete=false
  name: sc-example
provisioner: nfs.csi.k8s.io
parameters:
  server: nfs.example.com
  share: /
reclaimPolicy: Retain
volumeBindingMode: Immediate
mountOptions:
  - nfsvers=4.2
  - nolock
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-example
  labels:
    velero.io/include-in-backup: "true"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Gi
  storageClassName: sc-example
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: vsc-example
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: nfs.csi.k8s.io
parameters:
  server: nfs.example.com
  share: /
deletionPolicy: Delete

I hope this makes the issue a bit more clear?

Regards, Stella

Lyndon-Li commented 5 days ago

"failed to restore volume with StorageClass, claim Selector is not supported"

As I mentioned here, this is expected if you are running data mover restore.

Lyndon-Li commented 5 days ago

We do not back up the PV's, as we use a storage class that dynamically creates them when a PVC is added

Velero automatically select PVC and PV to back up. Varying from backup methods, sometimes PV object is backed up, sometimes it is not. And for data mover backup you are using, PV object is NOT backed up, and PVC object is backed up.

StellaV commented 2 days ago

@Lyndon-Li ,

Velero does by default select everything. But we only include the PVC in the backup. I'm not sure what the impact will be if we try to restore a dynamically created PV. But what I would prefer to see is that a new PV is created by the NFS driver once a PVC is restored by Velero.

Lyndon-Li commented 2 days ago

what I would prefer to see is that a new PV is created by the NFS driver once a PVC is restored by Velero

This will happen if the Velero after data mover restore completes. During the restore process, a PV will be created by the NFS driver and finally bind to the restored PVC after the data is restored to the PV.

Therefore, just check if you get any problem that the PVC is not restored successfully, just check if the DataDownload has completed successfully.

StellaV commented 2 days ago

what I would prefer to see is that a new PV is created by the NFS driver once a PVC is restored by Velero

This will happen if the Velero after data mover restore completes. During the restore process, a PV will be created by the NFS driver and finally bind to the restored PVC after the data is restored to the PV.

Therefore, just check if you get any problem that the PVC is not restored successfully, just check if the DataDownload has completed successfully.

That's exactly what I also expected to happen, but I get the "claim Selector is not supported" error instead. The DataDownload step is not even reached.

I see a similar issue here, which is the next driver we needed to test with Velero :)

Lyndon-Li commented 2 days ago

we only include the PVC in the backup. I'm not sure what the impact will be if we try to restore a dynamically created PV

This (only backing up/restoring the PVC, without the pod) doesn't relate to the provision method (dynamically or statically), but relates to the PVC's bindingMode. Specifically, if the bindingMode is Immediate, everything works well. But if the bindingMode is WaitForFirstConsumer, the restore will never complete until the PVC is mounted by a pod, see issue #7561. This is because of Kubernetes' designed constraint of WaitForFirstConsumer --- the PVC/PV is not provisioned until the pod is scheduled.

This is for PVC-only restore only, normal restores (PVCs with pod) doesn't have the problem.

StellaV commented 2 days ago

we only include the PVC in the backup. I'm not sure what the impact will be if we try to restore a dynamically created PV

This (only backing up/restoring the PVC, without the pod) doesn't relate to the provision method (dynamically or statically), but relates to the PVC's bindingMode. Specifically, if the bindingMode is Immediate, everything works well. But if the bindingMode is WaitForFirstConsumer, the restore will never complete until the PVC is mounted by a pod, see issue #7561. This is because of Kubernetes' designed constraint of WaitForFirstConsumer --- the PVC/PV is not provisioned until the pod is scheduled.

This is for PVC-only restore only, normal restores (PVCs with pod) doesn't have the problem.

That makes perfect sense. We have the bindingMode set to Immediate (see the yaml snippet I added earlier, this is almost the exact code we used). So, this should not be an issue

Lyndon-Li commented 2 days ago

OK, then as the expected behavior, the PVC should be restored successfully. If it is not for your case, just share us the velero log bundle by running velero debug

edhunter665 commented 1 day ago

We have same issue using vSphere CSI driver csi.vsphere.vmware.com. If bindingMode is set to Immediate the restore fails (partially). Everything but PV and PVC gets restored. If bindingMode is set to WaitForFirstConsumer the whole restore works fine.

Lyndon-Li commented 9 hours ago

@edhunter665 This doesn't look like the origin problem, so please open another issue and attach more details and the velero log bundle.