Open jayrajmeh0 opened 3 months ago
Please share the velero debug bundle by running velero debug
.
bundle-2024-07-30-11-41-59.tar.gz
This is the bundle you need, but I want to show you the actual issue: I have two CSI images generated—one using the PVC and storage class, and the other created from a volume snapshot YAML taken by the CSI. Both files share the same PVC, which is causing the problem. I'm not sure how to resolve this issue.
**Here, the 'csi-vol-***' RBD image was generated through the standard flow of our pod -> PVC -> storage class -> PV.
**And the 'csi-snap-***' RBD image was generated through the standard flow of our pod -> PVC -> VolumeSnapshot -> VolumeSnapshotClass.
Is the Velero snapshot movement functioning with Rook-based Ceph storage?
Looks like the same PVC ticketnfs/nfs-pvc
has been backed up twice in the same time. However, from the attached debug bundle, I don't see the logs for the backup which started around 2024-07-11T07
, but the logs in the bundle starts from 2024-07-29T11
.
And from the above screenshot, I can also see that PVC development2/mongo-ticketsdeployment2-pvc
are backed up twice by the same backup.
This is the cause of the issue since the dataupload is named as backup UID.PVC UID
@jayrajmeh0 Please share the logs when the backup was running. Basically, a PVC should be backed up once in the same backup, we need to logs to further troubleshoot why this rule was broken
Okay @Lyndon-Li, this is the latest bundle.
bundle-2024-07-30-18-06-07.tar.gz
Here, I created a new namespace with all the necessary configurations. After that, I tried to back up the 'media' namespace again. I've shared the related bundle with you.
From 1.14, CSI plugin has been changed to a in-tree plugin so the out-tree velero-plugin-for-csi should not be installed. In your env, you have installed a very old out-tree CSI plugin:
"containerID": "containerd://42ccbb1dff9dbecf3907c275357bd3c4640a3ba2bd8f3ed66ea682673d1da12c",
"image": "docker.io/velero/velero-plugin-for-csi:v0.1.1",
"imageID": "docker.io/velero/velero-plugin-for-csi@sha256:8300fd11a8cfbd638a95d8cdd40204f15099cbf7ae2b4c22d7ae790dcfce97ef",
"lastState": {},
"name": "velero-velero-plugin-for-csi",
As a result, the BIA of the same PVC has been executed twice since two plugins are registered. And since the plugin framework always delegates to the V2 BIA, the BIAV2 is called twice for the same PVC. Then two identical DUCRs are created.
@jayrajmeh0 To fix the problem, just modify the Velero deployment and remove the above out-tree CSI plugin.
Meanwhile, we need to enhance the plugin code to add a firewall to block these kind of plugin registration. Specifically, Velero supports multiple versions(v1, v2, etc.) for the same plugin. And each plugin has a unique name which is the same across all the versions. Then:
This means the current checking mechanism is not enough. As a solution, we can do below things:
The above enhancement bases on the judgement that for the same plugin the name should be unique across all versions. For example:
The name is velero.io/csi-pvc-backupper
for both V1 and V2.
@sseago @reasonerjt Let me know if you agree on the above judgement and solution. cc @blackpiglet
IMO, it's relatively safe to consider it a conflict if the name collides.
@Lyndon-Li ,@reasonerjt, thank you for your response. I would like to update you on my progress based on your suggestion, regarding the Velero installation with different scenarios. Below are the related outputs:
Case 1) remove "velero/velero-plugin-for-csi:v0.1.1" However, this resulted in the following error: Here is the related bundle file: bundle-2024-08-06-13-10-55.tar.gz
2) remove "velero/velero-plugin-for-aws:latest" However, this resulted in the following error: Here is the related bundle file: bundle-2024-08-06-14-37-43.tar.gz
3) remove both "velero/velero-plugin-for-aws:latest" & "velero/velero-plugin-for-csi:v0.1.1" This did not execute the node agent flag without the plugin flags.
4) add both "velero/velero-plugin-for-aws:latest" & "velero/velero-plugin-for-csi:v0.1.1" This resulted in the same error we discussed earlier: Here is the related bundle file: bundle-2024-08-06-14-45-44.tar.gz
Case 1 is the right approach, but from the log bundle, looks like node-agent is not installed/running, please double check.
@Lyndon-Li Thank you for your response. However, I have now installed the node-agent, and it is running. I attempted to back up, but I'm encountering a different error.
Here is the related bundle file: bundle-2024-08-06-17-38-42.tar.gz
@jayrajmeh0 This is a known issue for 1.14, see #7898. Additionally, if you see problems different from the original one, please open a new GH issue. We need to use the current issue to track the plugin enhancement mentioned above.
I just had a little test, the solution mentioned in https://github.com/vmware-tanzu/velero/issues/8058#issuecomment-2270420551 doesn't work. The BIA and RIA for the same object (e.g., PVC) may have the same name as well, in which case we will make a false block if we do it from the identical name only. I think the Kind in the plugin framework represents the kind/type itself and the version at present, so there are some difficulties to fix this problem from the plugin framework.
Since there is no quick fix, I will remove this issue from 1.15 and let's discuss how to cope with this kind of problem in the long term. cc @sseago @reasonerjt @blackpiglet
Description I am encountering an issue while attempting to back up and restore pods using Velero with Rook-Ceph as the storage backend. The error occurs when using the --snapshot-move-data flag for CSI snapshot data movement. Detailed steps and error messages are provided below.
Steps to Reproduce
Current Issue The backup and restore processes fail with an error indicating that there is more than one DataUpload found for the given operation ID. This occurs both during the initial backup and subsequent restore attempts.
Expected Outcome • Successful backup of pods using the --snapshot-move-data flag with Velero and Rook-Ceph. • Successful restore of the backed-up data without encountering the DataUpload error.
Actual Outcome • The backup process fails with an error related to multiple DataUploads being found for the same operation ID. • The restore process also fails with the same error.
Additional Information • Current Environment:
Request for Assistance I would appreciate guidance on the following: