vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.45k stars 1.37k forks source link

PV Snapshot not working after upgrading from velero 1.8.0 to 1.13.1 #7659

Open gravops opened 3 months ago

gravops commented 3 months ago

What steps did you take and what happened: Installed latest helm chart 6.0.0 with velero image 1.13.1 and AWS plugin 1.9.1 and CSI plugin 0.7.0.

What did you expect to happen: While creating a backup from CLI using a schedule, PV snapshot backup is not working. It is failing bundle-2024-04-11-16-38-03.tar.gz

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

gravops commented 3 months ago

I am getting issue like:

Errors:
  Velero:    name: /app message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=ns-testapp2, name=ebs-claim-test): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass default-storage-class: error getting volumesnapshotclass: failed to get volumesnapshotclass for provisioner ebs.csi.aws.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label
MoZadro commented 3 months ago

Hello, have you configured VolumeSnapshotClass, example https://medium.com/linux-shots/backup-kubernetes-using-velero-and-csi-volume-snapshot-4155d4e32e5d

gravops commented 3 months ago

Hello @MoZadro , As suggested I created VolumeSnapshotClass, and still getting same issue. Do I need to create vsclass and volumesnapshot objects for each provisioner type as well?

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-aws-vsc
snapshotter: ebs.csi.aws.com
deletionPolicy: Retain
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: ebs-volume-snapshot
spec:
  snapshotClassName: csi-aws-vsc
  source:
    name: ebs-claim
    kind: PersistentVolumeClaim
Errors:
  Velero:    message: /Error listing resources error: /the server could not find the requested resource
             name: /app message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=ns-testapp2-10002-dev, name=ebs-claim-test): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass default-storage-class: error getting volumesnapshotclass: failed to get volumesnapshotclass for provisioner ebs.csi.aws.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label
             name: /ebs-claim message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=ns-testapp2-10002-dev, name=ebs-claim): rpc error: code = Unknown desc = PVC ns-testapp2-10002-dev/ebs-claim has no volume backing this claim
             name: /efs-claim message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=ns-testapp2-10002-dev, name=efs-claim): rpc error: code = Unknown desc = PVC ns-testapp2-10002-dev/efs-claim has no volume backing this claim
             name: /prometheus-prometheus-operator-prometheus-db-prometheus-prometheus-operator-prometheus-0 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=prometheus, name=prometheus-prometheus-operator-prometheus-db-prometheus-prometheus-operator-prometheus-0): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass default-storage-class: error getting volumesnapshotclass: failed to get volumesnapshotclass for provisioner ebs.csi.aws.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label
  Cluster:   resource: /volumesnapshotclasses message: /Error listing items error: /the server could not find the requested resource
MoZadro commented 3 months ago

You need to create VolumeSnapshotClass with similar parameters as your storageClass. Of course to create a CSI-snapshot of a PVC you need Volume Snapshot Class.

For example my storageClass looks like this:

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-encryption
parameters:
  csi.storage.k8s.io/node-publish-secret-name: longhorn-crypto
  csi.storage.k8s.io/node-publish-secret-namespace: longhorn-system
  csi.storage.k8s.io/node-stage-secret-name: longhorn-crypto
  csi.storage.k8s.io/node-stage-secret-namespace: longhorn-system
  csi.storage.k8s.io/provisioner-secret-name: longhorn-crypto
  csi.storage.k8s.io/provisioner-secret-namespace: longhorn-system
  encrypted: 'true'
  fromBackup: ''
  numberOfReplicas: '2'
  staleReplicaTimeout: '2880'
provisioner: driver.longhorn.io
reclaimPolicy: Delete
volumeBindingMode: Immediate

my VolumeSnapshotClass like so:

kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
metadata:
  name: longhorn-encryption
  labels:
    velero.io/csi-volumesnapshot-class: "true" 
driver: driver.longhorn.io
deletionPolicy: Delete
gravops commented 3 months ago

@MoZadro , I created VolumeSnapshotClass for each of my storage class created. And now testing by creating a backup from schedule, i think this will work now but will confirm once backup and restore will be successful.

SC:

NAME                              PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
default-storage-class (default)   ebs.csi.aws.com         Delete          WaitForFirstConsumer   false                  77d
efs-sc-ns      efs.csi.aws.com         Retain          Immediate              false                  48d
gp2                               kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  204d

Volume Snapshot Class:

csi-aws-vsc-ebs   ebs.csi.aws.com         Delete           11m
csi-aws-vsc-efs   efs.csi.aws.com         Retain           11m
csi-k8s-vsc-ebs   kubernetes.io/aws-ebs   Delete           11m

But I have few queries:

  1. Previously in velero version 1.8.0, we didn't need to create VolumeSnapshotClass for the SC(provisioners) we had?
  2. Do we need to create VSC for all SC we have in our cluster?
  3. What this label is for "velero.io/csi-volumesnapshot-class: "true", do I need to give this for each VSC?
MoZadro commented 3 months ago

@gravops I don't work for Velero, so I can't provide you answers :) , i tried to help because i had similar issue with CSI plugin :)

gravops commented 3 months ago

NP, I will gather these information but many thanks for help @MoZadro . 👍 :) Appreciate it!

gravops commented 3 months ago

It started checking for PVCs now but I am getting timeouts when velero is trying to backup the PVCs.


  Velero:    message: /Timed out awaiting reconciliation of volumesnapshot ns-testapp2/velero-ebs-claim-test-kq7f6
             name: /app message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=ns-testapp2, name=velero-ebs-claim-test-kq7f6): rpc error: code = Unknown desc = timed out waiting for the condition
gravops commented 3 months ago

Now the PV backup is getting timed out, also one thing found that there is one random string which is getting attached to the PVC name(velero-ebs-claim-test-kw8cz). My PVC name is "velero-ebs-claim-test" but "kw8cz" is also attached in the logs.

Errors:
  Velero:    message: /Timed out awaiting reconciliation of volumesnapshot ns-testapp2-10002-dev/velero-ebs-claim-test-kw8cz
             name: /app message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=ns-testapp2-10002-dev, name=velero-ebs-claim-test-kw8cz): rpc error: code = Unknown desc = timed out waiting for the condition

I have below config for SC and VSC:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  name: default-storage-class
parameters:
  encrypted: "true"
  kmsKeyId: ""
  type: gp3
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

VSC:

apiVersion: snapshot.storage.k8s.io/v1
deletionPolicy: Delete
driver: ebs.csi.aws.com
kind: VolumeSnapshotClass
metadata:
  labels:
    velero.io/csi-volumesnapshot-class: "true"
  name: csi-aws-vsc-ebs
blackpiglet commented 3 months ago

https://velero.io/docs/v1.13/csi/#implementation-choices Please check this document to find out how the VolumeSnapshotClass should be created.

For short, if you prefer to have a default VolumeSnapshotClass, please apply this label velero.io/csi-volumesnapshot-class: "true" to that VolumeSnapshotClass. You can also fine-tunning the VolumeSnapshotClass setting for the backup if multiple classes are needed.

makarov-roman commented 3 months ago

I have similar issue, Although enableCSI=false, but backups are still failing with message that the server could not find the requested resource (get volumesnapshotclasses.snapshot.storage.k8s.io) (AWS/EBS).

kaovilai commented 3 months ago

@makarov-roman do you have the full log line where it says which line of code the error was printed from?

MoZadro commented 3 months ago

Hello, what if we have multiple storageClasses on the cluster, do we need to create for each storageClass VolumeSnapshotClass object, and since we also need to add label velero.io/csi-volumesnapshot-class=true in volume snapshot class to make this snapshot class default for volume snapshot created by velero. Only one VolumeSnapshotClass can be default one ?

gravops commented 3 months ago

@MoZadro If you have multiple storage then you will need to create VSC for each one of them and also you will need to remove "velero.io/csi-volumesnapshot-class=true" from your VSC definition. I did the same and it worked for me.

MoZadro commented 3 months ago

So if i have multiple storageClasses and multiple VolumeSnapshotClasses i need to remove "velero.io/csi-volumesnapshot-class=true" so this parameter is not defined on any VolumeSnapshotClass ?

gravops commented 3 months ago

yes, for me it is working like that only.

makarov-roman commented 3 months ago

@makarov-roman do you have the full log line where it says which line of code the error was printed from?

sorry, not anymore. I've migrated all environments on CSI snapshotter.

blackpiglet commented 3 months ago

I have similar issue, Although enableCSI=false, but backups are still failing with message that the server could not find the requested resource (get volumesnapshotclasses.snapshot.storage.k8s.io) (AWS/EBS).

That means the VolumeSnapshotClass CRD was not installed in the EKS environment.

makarov-roman commented 2 months ago

I have similar issue, Although enableCSI=false, but backups are still failing with message that the server could not find the requested resource (get volumesnapshotclasses.snapshot.storage.k8s.io) (AWS/EBS).

That means the VolumeSnapshotClass CRD was not installed in the EKS environment.

it wasn't. Why is it required without enableCSI? It wasn't before and was a bit unexpected.

sseago commented 2 months ago

@makarov-roman The main action of enabling CSI is to make CSI snapshots of volumes that don't use fs-backup. This entails creating VolumeSnapshots and VolumeSnapshotContents for PVCs to back up. This won't work without a VolumeSnapshotClass for your VolumeSnapshots. If you were on 1.8 before and are on 1.13 now, it could be that you have v1beta1 VolumeSnapshotClass defined but not v1. The CSI plugin moved from v1beta1 to v1 for VS, VSC, and VSClass in either Velero 1.9 or Velero 1.10 -- I forget the exact release, but I'm pretty sure that 1.8 still used the beta version.

makarov-roman commented 2 months ago

@sseago well, in my case the migration was done from 1.12 to 1.13 and I had the exactly the same error as OP. And I didn't have any VolumeSnapshotClass installed. (as well as useSnapshot=false and no enableCSI flag). @gravops can you confirm that it's also true for you? before the update it worked without any VolumeSnapshotClass

blackpiglet commented 2 months ago

Several similar issues are intervened together here. @makarov-roman Although your scenario is also related to the VolumeSnapshotClass CRD, it differs from the original issue.

the server could not find the requested resource (get volumesnapshotclasses.snapshot.storage.k8s.io)

That error looks like the client-go fails to get the VolumeSnapshotClass CRD from the kube-apiserver. It's more like the k8s resources discovery and collection problem, not related to back up the volume data by CSI snapshot.

makarov-roman commented 2 months ago

thanks for response @blackpiglet ah, I think you right, it's different. We didn't use snapshots, but after the velero update from v1.12 to v1.13 runtime failed because of missing volumesnapshotclass dependency.

blackpiglet commented 2 months ago

@makarov-roman Thanks. I see.

This is fixed in the main branch by PR #7515. But it's not cherry-picked into the 1.13 branch. I will create the cherry-pick PR.

blackpiglet commented 2 months ago

@makarov-roman

7789 is created to cherry-pick the change into release-1.13.