Take statefulsets backup on ACTIVE Cluster and restore it to the STAND BY cluster B using velero giving error

kish5430 commented 2 months ago

What steps did you take and what happened: While working on an active EKS cluster, I deployed an application with three etcd pods. I took a backup of these etcd pods using Velero. Later, I switched to a standby cluster and attempted to restore the backup. Although the restore process was successful and the pods were deployed and not running, there was a failure in attaching volumes to the etcd pods.

Command: velero backup create milvus-stg-east1-etcd-backup --selector 'app.kubernetes.io/name=etcd'

What did you expect to happen: Volume attachment should happen and etcd pods run without any issue.

Etcd Pod logs: Warning FailedAttachVolume 101s (x11 over 34m) attachdetach-controller (combined from similar events): AttachVolume.Attach failed for volume "pvc-ed7a6088-9f9e-46fc-88ab-bbe8364a28f7" : rpc error: code = Internal desc = Could not attach volume "vol-00c1e0e23881130c9" to node "i-03a2b2d33c76ccef2": could not attach volume "vol-00c1e0e23881130c9" to node "i-03a2b2d33c76ccef2": InvalidVolume.NotFound: The volume 'vol-00c1e0e23881130c9' does not exist. status code: 400, request id: 4160e339-013b-4b3b-8f39-c3990cf66c2e

Here volume 'vol-00c1e0e23881130c9'' is not exist in volumes in aws

Please find the attached velero restore logs. velero_restore.txt

allenxu404 commented 2 months ago

What Velero version are you using? Can you help provide us with more debug info by using the command from this doc.

kish5432 commented 2 months ago

@allenxu404 Please let me know if there is any additional information requires

allenxu404 commented 2 months ago

Log given above looks normal. PV was successfully restored from snapshot as below log message shows:

time="2024-04-25T05:46:33Z" level=info msg="Restoring persistent volume from snapshot." logSource="pkg/restore/restore.go:2453" restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-25T05:46:34Z" level=info msg="successfully restored persistent volume from snapshot" logSource="pkg/restore/pv_restorer.go:91" persistentVolume=pvc-ed7a6088-9f9e-46fc-88ab-bbe8364a28f7 providerSnapshotID=snap-0d4da2d4c9d3f2c0d restore=velero/milvus-stg-east1-etcd-restore

It seems that the VolumeId was not available for cluster B for some reason. I think you can further troubleshoot it by restore PV on ACTIVE cluster instead of STAND BY cluster B. I assume the restore will work in that case.

kish5430 commented 2 months ago

HI @allenxu404 Its not working on Active cluster also. I did velero restore on Active Cluster and getting same issue Thanks

blackpiglet commented 2 months ago

time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotclass.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotcontents.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshots.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore

It seems the CSI snapshot related CRDs are missed from the cluster.

kish5430 commented 2 months ago

time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotclass.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotcontents.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshots.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore

It seems the CSI snapshot related CRDs are missed from the cluster.

HI @blackpiglet

I have already installed volume snapshot crd's

$ kubectl api-resources | grep -i 'volume'
persistentvolumeclaims                    pvc                                 v1                                     true         PersistentVolumeClaim
persistentvolumes                             pv                                  v1                                     false        PersistentVolume
k8spspvolumetypes                                                             constraints.gatekeeper.sh/v1beta1      false        K8sPSPVolumeTypes
volumesnapshotclasses                    vsclass,vsclasses                   snapshot.storage.k8s.io/v1             false        VolumeSnapshotClass
volumesnapshotcontents                  vsc,vscs                            snapshot.storage.k8s.io/v1             false        VolumeSnapshotContent
volumesnapshots                                vs                                  snapshot.storage.k8s.io/v1             true         VolumeSnapshot
volumeattachments                                                             storage.k8s.io/v1                      false        VolumeAttachment
podvolumebackups                                                              velero.io/v1                           true         PodVolumeBackup
podvolumerestores                                                             velero.io/v1                           true         PodVolumeRestore
volumesnapshotlocations                   vsl                                 velero.io/v1                           true         VolumeSnapshotLocation

Thanks

allenxu404 commented 2 months ago

@kish5430 Can you help verify the status of the associated PV and PVC to confirm their functionality? Additionally, Can you access the AWS console to validate the volume's creation and ensure its proper configuration in the backend?

vmware-tanzu / velero

Take statefulsets backup on ACTIVE Cluster and restore it to the STAND BY cluster B using velero giving error #7737