vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.7k stars 1.4k forks source link

Waiting for CSI driver to reconcile volumesnapshot #5418

Closed vijay-yadav-3 closed 2 years ago

vijay-yadav-3 commented 2 years ago

I have installed velero with csi plugin for efs, volume snapshotter volume snapshot class and all the other required prerequisites. But in the End it is failing with this error. Velero is working without efs and all the ebs volumes are getting backed up. But For EFS backed PV it is getting stuck at this point.

time="2022-09-29T14:59:29Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:116" time="2022-09-29T14:59:33Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot cn/velero-node-1002-pvc-pxw25. Retrying in 5s" backup=velero/backup-node-1002 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:169" pluginName=velero-plugin-for-csi

blackpiglet commented 2 years ago

Hi, look like there is no error in the posted logs. Is the EFS VolumeSnapshot got timeout error in the end? If the Velero's version you are using no older than v1.7, please collect the Velero debug bundle file with command velero debug?

Furthermore, what is the size of EFS volume?

vijay-yadav-3 commented 2 years ago

Yes it got timed out in the end. I am not able to fetch logs but got this in the end

time="2022-10-04T11:21:06Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-stora ge-location logSource="pkg/controller/backup_storage_location_controller.go:116" I1004 11:21:09.054105 1 request.go:665] Waited for 1.046389052s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:4 43/apis/authentication.k8s.io/v1?timeout=32s time="2022-10-04T11:22:06Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource ="pkg/controller/backup_storage_location_controller.go:131" time="2022-10-04T11:22:06Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-stora ge-location logSource="pkg/controller/backup_storage_location_controller.go:116" time="2022-10-04T11:23:06Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource ="pkg/controller/backup_storage_location_controller.go:131" time="2022-10-04T11:23:06Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-stora ge-location logSource="pkg/controller/backup_storage_location_controller.go:116"

blackpiglet commented 2 years ago

Got it, then I think this is related to the size of the backup in the EFS volume. If you are using Velero with no older than v1.9.1, you can enlarge the CSI snapshot creation timeout with CSISnapshotTimeout https://velero.io/docs/v1.9/api-types/backup/

vijay-yadav-3 commented 2 years ago

I did not understand, we have one EFS and all the PVs are 1GB and 600+ PVs. Even if I try for only 1 PV it gives this error Waiting for CSI driver to reconcile volumesnapshot for over an hour and then fails.

On the other note how and where do I update CSISnapshotTimeout. Will try with updating it.

draghuram commented 2 years ago

As per the message:

"Waiting for CSI driver to reconcile volumesnapshot cn/velero-node-1002-pvc-pxw25. Retrying in 5s"

The operation will be retried in 5 seconds and this will continue for 10 minutes (by default). At the end of 10 minutes, you should the message:

"Timed out awaiting reconciliation of volumesnapshot ..."

Do you see such message in the log?

blackpiglet commented 2 years ago

@vijay-yadav-3 If your EFS PV data size is significant bigger than the EBS PV data size, I suggest to seperate them into two different backups. To set CSISnapshotTimeout value of backup, you can do this by velero backup create <backup-name> --csi-snapshot-timeout=1h, and please make sure the Velero version is no older than v1.9.1

vijay-yadav-3 commented 2 years ago

As per the message:

"Waiting for CSI driver to reconcile volumesnapshot cn/velero-node-1002-pvc-pxw25. Retrying in 5s"

The operation will be retried in 5 seconds and this will continue for 10 minutes (by default). At the end of 10 minutes, you should the message:

"Timed out awaiting reconciliation of volumesnapshot ..."

Do you see such message in the log?

Yes I See this Same Logs when trying.

draghuram commented 2 years ago

Ok. What is happening is that Velero creates VolumeSnapshot resource and expects to see corresponding VolumeSnapshotContent show up. But that is not happening here. VolumeSnapshotContent is created by snapshot controller on seeing a VolumeSnapshot resource so there must be some problem with it. You said EBS volumes are getting backed up. Do you know if Velero is taking CSI snapshots of EBS volumes or native EBS snapshots?

In any case, you should verify that snapshot controller is properly set up by manually creating a VolumeSnapshot and verify that a corresponding VolumeSnapshotContent is created. Until this succeeds Velero CSI backups will not work. Let me know if you need help with creating VolumeSnapshot manually.

shubham-pampattiwar commented 2 years ago

+1 on what @draghuram is suggesting here.

vijay-yadav-3 commented 2 years ago

@draghuram Do you know if Velero is taking CSI snapshots of EBS volumes or native EBS snapshots? Velero is taking native EBS Snapshots.

Let me know if you need help with creating VolumeSnapshot manually. Yes, I would like to know and test it out by creating Volume snapshot manually. Please let me know how to do that, will try it out by myself as well.

blackpiglet commented 2 years ago

@vijay-yadav-3 https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/5e1fcd3e915d62d3b091c6de780ff9e6816f3a7b/pkg/driver/controller.go#L430-L440

After checking EFS CSI driver code, I think it doesn't support snapshot function yet.

midhuns3343 commented 2 years ago

Ok. What is happening is that Velero creates VolumeSnapshot resource and expects to see corresponding VolumeSnapshotContent show up. But that is not happening here. VolumeSnapshotContent is created by snapshot controller on seeing a VolumeSnapshot resource so there must be some problem with it. You said EBS volumes are getting backed up. Do you know if Velero is taking CSI snapshots of EBS volumes or native EBS snapshots?

In any case, you should verify that snapshot controller is properly set up by manually creating a VolumeSnapshot and verify that a corresponding VolumeSnapshotContent is created. Until this succeeds Velero CSI backups will not work. Let me know if you need help with creating VolumeSnapshot manually.

hi @draghuram Can you please guide on how to manually create a VolumeSnapshot and verify that a corresponding VolumeSnapshotContent is created.

draghuram commented 2 years ago

Sure, I will post my comments in #5436.