Error with --snapshot-move-data Flag in Velero with Rook-Ceph Storage

jayrajmeh0 commented 3 months ago

Description I am encountering an issue while attempting to back up and restore pods using Velero with Rook-Ceph as the storage backend. The error occurs when using the --snapshot-move-data flag for CSI snapshot data movement. Detailed steps and error messages are provided below.

Steps to Reproduce

Setup Environment:
- Velero configured with Rook-Ceph as the storage backend.
- Kubernetes cluster managed by Kubespray.
Backup Pod:
- Execute the backup command with the --snapshot-move-data flag. velero backup create finaltest2 --snapshot-move-data
Encounter Error:
- Backup process fails with the following error message: Error: message: /fail to get DataUpload for backup velero/finaltest2 by operation ID du-ab80258b-f5a4-4495-80b8-39dfa0b8fd7c.cbd89117-444a-4c7e653c5: more than one DataUpload found operationID du-ab80258b-f5a4-4495-80b8-39dfa0b8fd7c.cbd89117-444a-4c7e653c5

Current Issue The backup and restore processes fail with an error indicating that there is more than one DataUpload found for the given operation ID. This occurs both during the initial backup and subsequent restore attempts.

Expected Outcome • Successful backup of pods using the --snapshot-move-data flag with Velero and Rook-Ceph. • Successful restore of the backed-up data without encountering the DataUpload error.

Actual Outcome • The backup process fails with an error related to multiple DataUploads being found for the same operation ID. • The restore process also fails with the same error.

Additional Information • Current Environment:

Kubernetes cluster managed by Kubespray.
Velero for backup and restore operations.
Rook-Ceph as the storage backend. • Error Details:
Error during backup: Error: message: /fail to get DataUpload for backup velero/finaltest2 by operation ID du-ab80258b-f5a4-4495-80b8-39dfa0b8fd7c.cbd89117-444a-4c7e653c5: more than one DataUpload found operationID du-ab80258b-f5a4-4495-80b8-39dfa0b8fd7c.cbd89117-444a-4c7e653c5

Request for Assistance I would appreciate guidance on the following:

How to resolve the error related to multiple DataUploads found for the same operation ID during backup and restore processes.
Any additional configuration steps or best practices for using the --snapshot-move-data flag with Velero and Rook-Ceph.
Clarification on whether this is a known issue and if there are any patches or updates available to address it.

Screenshot_508 Screenshot_517

Lyndon-Li commented 3 months ago

Please share the velero debug bundle by running velero debug.

jayrajmeh0 commented 3 months ago

bundle-2024-07-30-11-41-59.tar.gz

This is the bundle you need, but I want to show you the actual issue: I have two CSI images generated—one using the PVC and storage class, and the other created from a volume snapshot YAML taken by the CSI. Both files share the same PVC, which is causing the problem. I'm not sure how to resolve this issue.

Screenshot_670 Screenshot_671

**Here, the 'csi-vol-***' RBD image was generated through the standard flow of our pod -> PVC -> storage class -> PV. Screenshot_672

**And the 'csi-snap-***' RBD image was generated through the standard flow of our pod -> PVC -> VolumeSnapshot -> VolumeSnapshotClass. Screenshot_673

Is the Velero snapshot movement functioning with Rook-based Ceph storage?

Lyndon-Li commented 3 months ago

Looks like the same PVC ticketnfs/nfs-pvc has been backed up twice in the same time. However, from the attached debug bundle, I don't see the logs for the backup which started around 2024-07-11T07, but the logs in the bundle starts from 2024-07-29T11.

And from the above screenshot, I can also see that PVC development2/mongo-ticketsdeployment2-pvc are backed up twice by the same backup.

This is the cause of the issue since the dataupload is named as backup UID.PVC UID

Lyndon-Li commented 3 months ago

@jayrajmeh0 Please share the logs when the backup was running. Basically, a PVC should be backed up once in the same backup, we need to logs to further troubleshoot why this rule was broken

jayrajmeh0 commented 3 months ago

Okay @Lyndon-Li, this is the latest bundle.

bundle-2024-07-30-18-06-07.tar.gz

Here, I created a new namespace with all the necessary configurations. After that, I tried to back up the 'media' namespace again. I've shared the related bundle with you.

Screenshot_695

Lyndon-Li commented 2 months ago

From 1.14, CSI plugin has been changed to a in-tree plugin so the out-tree velero-plugin-for-csi should not be installed. In your env, you have installed a very old out-tree CSI plugin:

                        "containerID": "containerd://42ccbb1dff9dbecf3907c275357bd3c4640a3ba2bd8f3ed66ea682673d1da12c",
                        "image": "docker.io/velero/velero-plugin-for-csi:v0.1.1",
                        "imageID": "docker.io/velero/velero-plugin-for-csi@sha256:8300fd11a8cfbd638a95d8cdd40204f15099cbf7ae2b4c22d7ae790dcfce97ef",
                        "lastState": {},
                        "name": "velero-velero-plugin-for-csi",

As a result, the BIA of the same PVC has been executed twice since two plugins are registered. And since the plugin framework always delegates to the V2 BIA, the BIAV2 is called twice for the same PVC. Then two identical DUCRs are created.

Lyndon-Li commented 2 months ago

@jayrajmeh0 To fix the problem, just modify the Velero deployment and remove the above out-tree CSI plugin.

Lyndon-Li commented 2 months ago

Meanwhile, we need to enhance the plugin code to add a firewall to block these kind of plugin registration. Specifically, Velero supports multiple versions(v1, v2, etc.) for the same plugin. And each plugin has a unique name which is the same across all the versions. Then:

For the same plugin name, for the same version, if two entities are registered at the same time (e.g., one is in-tree and the other is out-tree), the second registration will fail since their identity are the same to the plugin framework
However, for the same plugin name, if two entities with different versions are registered at the same time, the both registrations pass since their identities are not the same

This means the current checking mechanism is not enough. As a solution, we can do below things:

When a plugin entity is registered, we get its kind and name
We enum the existing entities with the same kind
If the existing ones already contain one with the same name, we fail the registration

The above enhancement bases on the judgement that for the same plugin the name should be unique across all versions. For example:

A BIA V2 plugin for PVC by velero-csi: kind=BackupItemActionV2, name=velero.io/csi-pvc-backupper
A BIA V1 plugin for PVC by velero-csi: kind=BackupItemAction, name=velero.io/csi-pvc-backupper

The name is velero.io/csi-pvc-backupper for both V1 and V2.

@sseago @reasonerjt Let me know if you agree on the above judgement and solution. cc @blackpiglet

reasonerjt commented 2 months ago

IMO, it's relatively safe to consider it a conflict if the name collides.

jayrajmeh0 commented 2 months ago

@Lyndon-Li ,@reasonerjt, thank you for your response. I would like to update you on my progress based on your suggestion, regarding the Velero installation with different scenarios. Below are the related outputs:

Case 1) remove "velero/velero-plugin-for-csi:v0.1.1" Screenshot_737 However, this resulted in the following error: Screenshot_738 Here is the related bundle file: bundle-2024-08-06-13-10-55.tar.gz

2) remove "velero/velero-plugin-for-aws:latest" Screenshot_739 However, this resulted in the following error: Screenshot_740 Here is the related bundle file: bundle-2024-08-06-14-37-43.tar.gz

3) remove both "velero/velero-plugin-for-aws:latest" & "velero/velero-plugin-for-csi:v0.1.1" Screenshot_741 This did not execute the node agent flag without the plugin flags.

4) add both "velero/velero-plugin-for-aws:latest" & "velero/velero-plugin-for-csi:v0.1.1" Screenshot_742 This resulted in the same error we discussed earlier: Screenshot_743 Here is the related bundle file: bundle-2024-08-06-14-45-44.tar.gz

Lyndon-Li commented 2 months ago

Case 1 is the right approach, but from the log bundle, looks like node-agent is not installed/running, please double check.

jayrajmeh0 commented 2 months ago

@Lyndon-Li Thank you for your response. However, I have now installed the node-agent, and it is running. I attempted to back up, but I'm encountering a different error.

Screenshot_752 Screenshot_749 Screenshot_751 Screenshot_750

Here is the related bundle file: bundle-2024-08-06-17-38-42.tar.gz

Lyndon-Li commented 2 months ago

@jayrajmeh0 This is a known issue for 1.14, see #7898. Additionally, if you see problems different from the original one, please open a new GH issue. We need to use the current issue to track the plugin enhancement mentioned above.

Lyndon-Li commented 2 months ago

I just had a little test, the solution mentioned in https://github.com/vmware-tanzu/velero/issues/8058#issuecomment-2270420551 doesn't work. The BIA and RIA for the same object (e.g., PVC) may have the same name as well, in which case we will make a false block if we do it from the identical name only. I think the Kind in the plugin framework represents the kind/type itself and the version at present, so there are some difficulties to fix this problem from the plugin framework.

Since there is no quick fix, I will remove this issue from 1.15 and let's discuss how to cope with this kind of problem in the long term. cc @sseago @reasonerjt @blackpiglet

vmware-tanzu / velero

Error with --snapshot-move-data Flag in Velero with Rook-Ceph Storage #8058