vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.59k stars 1.39k forks source link

Backup from one cluster and restore in another using rook ceph storage. #2972

Closed pavanfhw closed 3 years ago

pavanfhw commented 3 years ago

What steps did you take and what happened: Installed velero in 2 cluster which uses rook ceph storage. I can take backups and restore them to the same cluster after deleting all resources in a namespace. But when restoring to another cluster, the PVC cannot be provisioned by rook ceph. Both rook-ceph and velero were deployed the same way in both clusters. I tried both way, backup in cluster 1 and restore in cluster 2 and vice-versa, in both cases the error was the same. PVC provisioning failure:

rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-dbc67ffdc-vj2r8_0550728b-5527-41d3-ad5a-c0c6307056b7 failed to provision volume with StorageClass "rook-ceph-block-storage": rpc error: code = Internal desc = key not found: no snap source in omap for "csi.snap.84f0f012-fe89-11ea-bbf7-2ee9609329c4"

What did you expect to happen: Be able to take a rook-ceph volumes backup from one cluster and restore it on another (is this known to be possible?).

The output of the following commands will help us better understand what's going on: (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Using Rook 1.4.3

Velero was intalled with the command

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.0.0,velero/velero-plugin-for-csi:v0.1.0  \
    --bucket velero-backup \
    --secret-file ./velero-credentials \
    --use-volume-snapshots=true \
    --backup-location-config region=default,s3ForcePathStyle="true",s3Url=https://mys3URL/,insecureSkipTLSVerify=true \
    --snapshot-location-config region=default \
    --features=EnableCSI

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

WaterKnight1998 commented 3 years ago

@pavanfhw did you managed to make it work?

pavanfhw commented 3 years ago

@WaterKnight1998 no, I got it working with restic though. The problem above was using the csi-plugin.

dsu-igeek commented 3 years ago

Is the Ceph cluster shared or are these different storage systems? The snapshot is going to be a Ceph snapshot and is only accessible within the Ceph cluster.

pavanfhw commented 3 years ago

@dsu-igeek yes, they are different storages and ceph clusters. I was simulating the situation where if one cluster blew up, velero backups would be able to restore it on a completely new cluster.

dsu-igeek commented 3 years ago

So if you're using the CSI plugin, currently that will take a snapshot using the Ceph snapshotting facility. Ceph snapshots are stored within the cluster. When the cluster is lost/removed all snapshot data will be lost as well. Unfortunately CSI snapshots do not specify whether the snapshot is "durable" (survives loss of primary storage) or not. You should use Restic backup of your data, the snapshots on the Ceph cluster are not really backups.

pavanfhw commented 3 years ago

Understood. To clarify, CSI plugin are not viable to do disaster recovery backups? Is there intention to make them viable?

joshkwedar commented 3 years ago

Same questions as @pavanfhw. Not much documentation on this limitation if CSI snapshots are not meant for DR scenarios if you’ve deployed independent rook-ceph clusters.

dsu-igeek commented 3 years ago

I added a warning note in the README - https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/main/README.md

We will be addressing this in a future release but for the moment I recommend you use a Restic backup.

psavva commented 3 years ago

@pavanfhw Could you share your instructions, scripts and anything else to demonstrate how to do a full backup of a Kubernetes cluster which uses rook-ceph?

I am also looking to utalize Velero for my backup and restore solution to a new cluster, in case of something catastrophic happening. Also good for Testing regions and simulating issues on a test region prior to taking it into production...

Are you able to share your steps/scripts/instructions how you achieved it? Maybe a blog writeup somewhere?

gitsridhar commented 2 years ago

I saw the same problem with OCP 4.9 ad ODF 4.9 in ppc64le platform. The problem went away when the 'Delete Policy' was changed from Delete to Retain for VolumeSnapshotClass object (there are two of them). The restore was completed without any errors.

Lukeesec commented 1 year ago

@pavanfhw also interested in the code used to backup from one cluster to another for Rook. Just looking to bump this incase you didn't see the previous message from Psavva