vmware-tanzu / velero-plugin-for-vsphere

Plugin to support Velero on vSphere
Other
59 stars 49 forks source link

failed to restore with error: "Failed to create a CloneFromSnapshot CR" #428

Open wanghuok opened 2 years ago

wanghuok commented 2 years ago

Describe the bug

restore failed

To Reproduce

  1. setup tkgi 1.13
  2. create a k8s cluster
  3. install velero and velero-plugin-for-vsphere:v1.3.0-rc4
  4. create 10 namespaces, 100 statefulsets with 1g pv each
  5. successfully to backup 10 namespaces , 2 namespaces each time
  6. delete all 10 namespaces
  7. failed to restore from previous backups:

Expected behavior

restore successfully

Troubleshooting Information

[Please refer to the Troubleshooting page and collect the required information]

found some errors like this by "velero restore logs ": time="2021-11-23T08:11:17Z" level=error msg="Failed to create a CloneFromSnapshot CR: Phase=Failed, err=cloneFromSnapshot: Failed at calling SnapshotManager CreateVolumeFromSnapshotWithMetadata with peId pvc:velero-1g-2/appdb-postgres-wl-12-0, err: Create download cr failed.\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr.(*SnapshotManager).CreateVolumeFromSnapshot.func2\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr/snapshot_manager.go:701\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:217\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:230\nk8s.io/apimachinery/pkg/util/wait.WaitForWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:657\nk8s.io/apimachinery/pkg/util/wait.poll\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:591\nk8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:563\nk8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:552\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr.(*SnapshotManager).CreateVolumeFromSnapshot\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr/snapshot_manager.go:685\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr.DataManagerProtectedEntity.Overwrite\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr/data_manager_protected_entity.go:21\ngithub.com/vmware-tanzu/astrolabe/pkg/pvc.(*PVCProtectedEntityTypeManager).CreateFromMetadata\n\t/go/pkg/mod/github.com/vmware-tanzu/astrolabe@v0.4.1-0.20210813185044-12eb18c3f6d5/pkg/pvc/pvc_protected_entity_type_manager.go:227\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr.(*SnapshotManager).CreateVolumeFromSnapshotWithMetadata\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr/snapshot_manager.go:776\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).cloneFromSnapshot\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller.go:311\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).syncCloneFromSnapshotByKey\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller_base.go:617\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).cloneFromSnapshotWorker\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller_base.go:562\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:90\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\nFailed to get the source snapshot PE for peID ivd:12faa9ba-bb5c-4d31-b860-dd7c58f911a7:963f2a93-d505-434b-a2f8-134075a65177\ngithub.com/vmware-tanzu/astrolabe/pkg/pvc.(*PVCProtectedEntityTypeManager).CreateFromMetadata\n\t/go/pkg/mod/github.com/vmware-tanzu/astrolabe@v0.4.1-0.20210813185044-12eb18c3f6d5/pkg/pvc/pvc_protected_entity_type_manager.go:231\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr.(*SnapshotManager).CreateVolumeFromSnapshotWithMetadata\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr/snapshot_manager.go:776\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).cloneFromSnapshot\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller.go:311\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).syncCloneFromSnapshotByKey\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller_base.go:617\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).cloneFromSnapshotWorker\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller_base.go:562\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:90\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\nError creating volume from metadata\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr.(*SnapshotManager).CreateVolumeFromSnapshotWithMetadata\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr/snapshot_manager.go:779\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).cloneFromSnapshot\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller.go:311\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).syncCloneFromSnapshotByKey\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller_base.go:617\ngithub.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).cloneFromSnapshotWorker\n\t/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller_base.go:562\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:90\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371" cmd=/plugins/velero-plugin-for-vsphere logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/plugin/restore_pvc_action_plugin.go:208" pluginName=velero-plugin-for-vsphere restore=velero/hu-rs-1 it said " Failed to get the source snapshot PE for peID ivd:12faa9ba-bb5c-4d31-b860-dd7c58f911a7:963f2a93-d505-434b-a2f8-134075a65177". but i checked the ivd is exists.

found some erros like this by "velero restore describe --details": error preparing persistentvolumeclaims/velero-1g-2/appdb-postgres-wl-72-0: rpc error: code = Unknown desc = Failed to create a CloneFromSnapshot CR: Phase=Failed, err=cloneFromSnapshot: Failed at calling SnapshotManager CreateVolumeFromSnapshotWithMetadata with peId pvc:velero-1g-2/appdb-postgres-wl-72-0, err: PVC "appdb-postgres-wl-72-0" did not become Bound: PersistentVolumeClaims [appdb-postgres-wl-72-0] not all in phase Bound within 3m0s Error creating volume from metadata github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr.(*SnapshotManager).CreateVolumeFromSnapshotWithMetadata /go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr/snapshot_manager.go:779 github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).cloneFromSnapshot /go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller.go:311 github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).syncCloneFromSnapshotByKey /go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller_base.go:617 github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver.(*backupDriverController).cloneFromSnapshotWorker /go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller_base.go:562 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1 /go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:155 k8s.io/apimachinery/pkg/util/wait.BackoffUntil /go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:156 k8s.io/apimachinery/pkg/util/wait.JitterUntil /go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:133 k8s.io/apimachinery/pkg/util/wait.Until /go/pkg/mod/k8s.io/apimachinery@v0.22.0/pkg/util/wait/wait.go:90 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1371

[If applicable, add screenshots to help explain your problem.]

[Miscellaneous information that will assist in solving the issu hu-rs-1-details.log hu-rs-1.log e]

qiuming-best commented 2 years ago

@wanghuok Also I've encountered this error, I just retry the restore action, the restore is successful without error.

liyan-su commented 2 years ago

Hi dev, any update on this issue?

deepakkinni commented 2 years ago

It appears that there are mainly 2 kinds of errors from the attached logs:

  1. Create download cr failed. This was identified earlier, and fixed in https://github.com/vmware-tanzu/velero-plugin-for-vsphere/pull/434. The fix is available in 1.3.1, would you be willing to try it out?

  2. PVC did not become Bound: PersistentVolumeClaims not all in phase Bound within 3m0s Please provide the csi driver logs for this.

Our test team reported these errors in the past, but did not see the issue in 1.3.1 plugin version. If you do observe the issue please attach the backup-driver and data-manager logs as well.

liyan-su commented 2 years ago

Thanks @deepakkinni reply, we cann't provide backup-driver and data-manager logs due to we don't have the environment yet, we will verify this issue in tkgi1.14 if we can. thanks

jhuisss commented 2 years ago

It appears that there are mainly 2 kinds of errors from the attached logs:

  1. Create download cr failed. This was identified earlier, and fixed in Retry if there are errors retrieving download CR #434. The fix is available in 1.3.1, would you be willing to try it out?
  2. PVC did not become Bound: PersistentVolumeClaims not all in phase Bound within 3m0s Please provide the csi driver logs for this.

Our test team reported these errors in the past, but did not see the issue in 1.3.1 plugin version. If you do observe the issue please attach the backup-driver and data-manager logs as well.

I also hit the issue, and here is the logs you asked: error msg from data-manager pod logs: time="2022-10-11T09:58:39Z" level=info msg="vCenter connected, proceeding with operation" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/ivd/ivd_protected_entity_type_manager.go:309" 2022-10-11T09:58:39.230Z warning -[00025] [Originator@6876 sub=Default] Closing Response processing in unexpected state: 3 2022-10-11T09:58:39.243Z warning -[00013] [Originator@6876 sub=vimaccess] cannot get thumbprint: SSL error code '151441516', exception: 'Wrong X.509 Certificate format' 2022-10-11T09:58:39.570Z warning -[00021] [Originator@6876 sub=Default] Closing Response processing in unexpected state: 3 2022-10-11T09:58:39.582Z warning -[00013] [Originator@6876 sub=vimaccess] cannot get thumbprint: SSL error code '151441516', exception: 'Wrong X.509 Certificate format' ... time="2022-10-11T09:58:41Z" level=error msg="Failed at copying to remote repository" Local PEID="ivd:a1ab225c-b98d-4f38-9200-9f0584cff5c5:098881e4-a9f4-4892-80f5-3c638932a0a8" error="Failed to delete peinfo from bucket \"velero\": Unable to delete object \"plugins/vsphere-astrolabe-repo/ivd/peinfo/ivd:a1ab225c-b98d-4f38-9200-9f0584cff5c5:098881e4-a9f4-4892-80f5-3c638932a0a8\" from bucket \"velero\": RequestError: send request failed\ncaused by: Get \"https://10.221.138.*:9000/velero?prefix=plugins%2Fvsphere-astrolabe-repo%2Fivd%2Fpeinfo%2Fivd%3Aa1ab225c-b98d-4f38-9200-9f0584cff5c5%3A098881e4-a9f4-4892-80f5-3c638932a0a8\": x509: certificate signed by unknown authority" error.file="/go/pkg/mod/github.com/vmware-tanzu/astrolabe@v0.5.1/pkg/s3repository/repository_protected_entity.go:183" error.function=github.com/vmware-tanzu/astrolabe/pkg/s3repository.ProtectedEntity.deleteSnapshotComponents logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/dataMover/data_mover.go:120" ...

error msg from backup-driver pod logs:

time="2022-10-11T10:01:59Z" level=error msg="Failed to get the source snapshot PE for peID ivd:a1ab225c-b98d-4f38-9200-9f0584cff5c5:098881e4-a9f4-4892-80f5-3c638932a0a8" error="GetObject failed for bucket velero, key plugins/vsphere-astrolabe-repo/ivd/peinfo/ivd:a1ab225c-b98d-4f38-9200-9f0584cff5c5:098881e4-a9f4-4892-80f5-3c638932a0a8: RequestError: send request failed\ncaused by: Get \"https://10.221.138.*:9000/velero/plugins/vsphere-astrolabe-repo/ivd/peinfo/ivd%3Aa1ab225c-b98d-4f38-9200-9f0584cff5c5%3A098881e4-a9f4-4892-80f5-3c638932a0a8\": x509: certificate signed by unknown authority" error.file="/go/pkg/mod/github.com/vmware-tanzu/astrolabe@v0.5.1/pkg/s3repository/repository_protected_entity_type_manager.go:176" error.function="github.com/vmware-tanzu/astrolabe/pkg/s3repository.(ProtectedEntityTypeManager).GetProtectedEntity" logSource="/go/pkg/mod/github.com/vmware-tanzu/astrolabe@v0.5.1/pkg/pvc/pvc_protected_entity_type_manager.go:222" time="2022-10-11T10:01:59Z" level=error msg="Error creating volume from metadata" error="Failed to get the source snapshot PE for peID ivd:a1ab225c-b98d-4f38-9200-9f0584cff5c5:098881e4-a9f4-4892-80f5-3c638932a0a8: GetObject failed for bucket velero, key plugins/vsphere-astrolabe-repo/ivd/peinfo/ivd:a1ab225c-b98d-4f38-9200-9f0584cff5c5:098881e4-a9f4-4892-80f5-3c638932a0a8: RequestError: send request failed\ncaused by: Get \"https://10.221.138.:9000/velero/plugins/vsphere-astrolabe-repo/ivd/peinfo/ivd%3Aa1ab225c-b98d-4f38-9200-9f0584cff5c5%3A098881e4-a9f4-4892-80f5-3c638932a0a8\": x509: certificate signed by unknown authority" error.file="/go/pkg/mod/github.com/vmware-tanzu/astrolabe@v0.5.1/pkg/s3repository/repository_protected_entity_type_manager.go:176" error.function="github.com/vmware-tanzu/astrolabe/pkg/s3repository.(ProtectedEntityTypeManager).GetProtectedEntity" logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/snapshotmgr/snapshot_manager.go:778" time="2022-10-11T10:01:59Z" level=error msg="cloneFromSnapshot nginx-example/577b07bd-a3f0-4341-b29e-94060e2fbb23 failed: Error creating volume from metadata: Failed to get the source snapshot PE for peID ivd:a1ab225c-b98d-4f38-9200-9f0584cff5c5:098881e4-a9f4-4892-80f5-3c638932a0a8: GetObject failed for bucket velero, key plugins/vsphere-astrolabe-repo/ivd/peinfo/ivd:a1ab225c-b98d-4f38-9200-9f0584cff5c5:098881e4-a9f4-4892-80f5-3c638932a0a8: RequestError: send request failed\ncaused by: Get \"https://10.221.138.:9000/velero/plugins/vsphere-astrolabe-repo/ivd/peinfo/ivd%3Aa1ab225c-b98d-4f38-9200-9f0584cff5c5%3A098881e4-a9f4-4892-80f5-3c638932a0a8\": x509: certificate signed by unknown authority" controller=BackupDriverController logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/backupdriver/backup_driver_controller_base.go:619"