Backup from one cluster and restore in another using rook ceph storage.

pavanfhw commented 3 years ago

What steps did you take and what happened: Installed velero in 2 cluster which uses rook ceph storage. I can take backups and restore them to the same cluster after deleting all resources in a namespace. But when restoring to another cluster, the PVC cannot be provisioned by rook ceph. Both rook-ceph and velero were deployed the same way in both clusters. I tried both way, backup in cluster 1 and restore in cluster 2 and vice-versa, in both cases the error was the same. PVC provisioning failure:

rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-dbc67ffdc-vj2r8_0550728b-5527-41d3-ad5a-c0c6307056b7 failed to provision volume with StorageClass "rook-ceph-block-storage": rpc error: code = Internal desc = key not found: no snap source in omap for "csi.snap.84f0f012-fe89-11ea-bbf7-2ee9609329c4"

What did you expect to happen: Be able to take a rook-ceph volumes backup from one cluster and restore it on another (is this known to be possible?).

The output of the following commands will help us better understand what's going on: (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero https://gist.github.com/pavanfhw/91148b5ba1126fcca771cc447de7c957
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml https://gist.github.com/pavanfhw/50b9e74abf6ca944cb36d2e970fa2d12
velero backup logs <backupname> https://gist.github.com/pavanfhw/a3244002f96dd40bcbc683c3cfa87bf1
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml https://gist.github.com/pavanfhw/33a8771de9d64b4ddda075e3f061d6cc
velero restore logs <restorename> https://gist.github.com/pavanfhw/67a4ff961ded1884bf6a7a1e8289cb70

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Using Rook 1.4.3

Velero was intalled with the command

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.0.0,velero/velero-plugin-for-csi:v0.1.0  \
    --bucket velero-backup \
    --secret-file ./velero-credentials \
    --use-volume-snapshots=true \
    --backup-location-config region=default,s3ForcePathStyle="true",s3Url=https://mys3URL/,insecureSkipTLSVerify=true \
    --snapshot-location-config region=default \
    --features=EnableCSI

Environment:

Velero version (use velero version): Client: Version: v1.4.2 Git commit: 56a08a4d695d893f0863f697c2f926e27d70c0c5 Server: Version: v1.4.2
Velero features (use velero client config get features): features: EnableCSI
Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6+k3s1", GitCommit:"6f56fa1d68a5a48b8b6fdefa8eb7ead2015a4b3a", GitTreeState:"clean", BuildDate:"2020-07-16T20:46:15Z", GoVersion:"go1.13.11", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6+k3s1", GitCommit:"6f56fa1d68a5a48b8b6fdefa8eb7ead2015a4b3a", GitTreeState:"clean", BuildDate:"2020-07-16T20:46:15Z", GoVersion:"go1.13.11", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes installer & version: k3s version v1.18.6+k3s1 (6f56fa1d)
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): K3OS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

WaterKnight1998 commented 3 years ago

@pavanfhw did you managed to make it work?

pavanfhw commented 3 years ago

@WaterKnight1998 no, I got it working with restic though. The problem above was using the csi-plugin.

dsu-igeek commented 3 years ago

Is the Ceph cluster shared or are these different storage systems? The snapshot is going to be a Ceph snapshot and is only accessible within the Ceph cluster.

pavanfhw commented 3 years ago

@dsu-igeek yes, they are different storages and ceph clusters. I was simulating the situation where if one cluster blew up, velero backups would be able to restore it on a completely new cluster.

dsu-igeek commented 3 years ago

So if you're using the CSI plugin, currently that will take a snapshot using the Ceph snapshotting facility. Ceph snapshots are stored within the cluster. When the cluster is lost/removed all snapshot data will be lost as well. Unfortunately CSI snapshots do not specify whether the snapshot is "durable" (survives loss of primary storage) or not. You should use Restic backup of your data, the snapshots on the Ceph cluster are not really backups.

pavanfhw commented 3 years ago

Understood. To clarify, CSI plugin are not viable to do disaster recovery backups? Is there intention to make them viable?

joshkwedar commented 3 years ago

Same questions as @pavanfhw. Not much documentation on this limitation if CSI snapshots are not meant for DR scenarios if you’ve deployed independent rook-ceph clusters.

dsu-igeek commented 3 years ago

I added a warning note in the README - https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/main/README.md

We will be addressing this in a future release but for the moment I recommend you use a Restic backup.

psavva commented 3 years ago

@pavanfhw Could you share your instructions, scripts and anything else to demonstrate how to do a full backup of a Kubernetes cluster which uses rook-ceph?

I am also looking to utalize Velero for my backup and restore solution to a new cluster, in case of something catastrophic happening. Also good for Testing regions and simulating issues on a test region prior to taking it into production...

Are you able to share your steps/scripts/instructions how you achieved it? Maybe a blog writeup somewhere?

gitsridhar commented 2 years ago

I saw the same problem with OCP 4.9 ad ODF 4.9 in ppc64le platform. The problem went away when the 'Delete Policy' was changed from Delete to Retain for VolumeSnapshotClass object (there are two of them). The restore was completed without any errors.

Lukeesec commented 1 year ago

@pavanfhw also interested in the code used to backup from one cluster to another for Rook. Just looking to bump this incase you didn't see the previous message from Psavva

vmware-tanzu / velero

Backup from one cluster and restore in another using rook ceph storage. #2972