vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.57k stars 1.39k forks source link

Error with --snapshot-move-data Flag in Velero with Rook-Ceph Storage backup time #8092

Open jayrajmeh0 opened 1 month ago

jayrajmeh0 commented 1 month ago

Description I am encountering an issue while attempting to back up and restore pods using Velero with Rook-Ceph as the storage backend. The error occurs when using the --snapshot-move-data flag for CSI snapshot data movement. Detailed steps and error messages are provided below.

Steps to Reproduce

Current Issue The backup processes fail with an error stating partially failed.

Expected Outcome • Successful backup of pods using the --snapshot-move-data flag with Velero and Rook-Ceph. • Successful restore of the backed-up data without encountering the DataUpload error.

Actual Outcome • The backup process fails with an error related to partially failed.

Additional Information • Current Environment:

Request for Assistance I would appreciate guidance on the following:

  1. Any additional configuration steps or best practices for using the --snapshot-move-data flag with Velero and Rook-Ceph.
  2. Clarification on whether this is a known issue and if there are any patches or updates available to address it.

I have now installed the node-agent, and it is running. I attempted to back up, but I'm encountering a error.

Screenshot_752 Screenshot_749 Screenshot_751 Screenshot_750

Here is the related bundle file: bundle-2024-08-06-17-38-42.tar.gz

Lyndon-Li commented 1 month ago

Dup with #7898

jayrajmeh0 commented 1 month ago

@Lyndon-Li I don’t understand; there’s nothing related to my current issue. Do you have a simple solution that will work for me and help me move forward with my tasks?

Lyndon-Li commented 1 month ago

Here is the error associated to the partially failed backup: "message": "found a dataupload velero/media6-fj8fl with expose error: Pod is unschedulable: 0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.. mark it as cancel",

jayrajmeh0 commented 1 month ago

@Lyndon-Li You have no idea

"message": "found a dataupload velero/media6-fj8fl with expose error: Pod is unschedulable: 0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.. mark it as cancel",

why this issue occurred, and the error message displayed does not provide a solution. The issue #7898 lacks any description for resolution. Is it possible that the problem lies in my installation steps or elsewhere? Any guidance to help resolve this issue would be appreciated.

Lyndon-Li commented 1 month ago

There is no resolution/workaround, you have to wait the release of 1.14.1 or you need to downgrade to 1.13.x

jayrajmeh0 commented 1 month ago

@Lyndon-Li This means that the issue

"message": "found a dataupload velero/media6-fj8fl with expose error: Pod is unschedulable: 0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.. mark it as cancel",

did not occur in version 1.13.x, and my backup process was executed using the --snapshot-move-data flag, correct?

Lyndon-Li commented 1 month ago

Yes

jayrajmeh0 commented 1 month ago

@Lyndon-Li Thank you! I will check it and keep you updated.

jayrajmeh0 commented 1 month ago

@Lyndon-Li Following your suggestion, I used version 1.13.2. After installing it and attempting to back up with the --snapshot-move-data flag, the backup status completed without any errors. However, the snapshot data was not moved as expected. I also checked the MinIO S3 bucket, and there was no Kopia folder created for the snapshot backup. In short, this is not meeting my requirements. Do you have any solutions related to this issue?

Screenshot_754 Screenshot_755 Screenshot_756 Screenshot_758 Screenshot_757 Screenshot_759 Screenshot_760 Screenshot_761

Here is the related bundle file: bundle-2024-08-07-15-23-09.tar.gz

Lyndon-Li commented 1 month ago

For Velero prior to 1.14, you need to add CSI plugin in installation if you want to use data mover feature.

jayrajmeh0 commented 1 month ago

@Lyndon-Li I'm encountering this error related to volume snapshots and classes, even though I've already defined them in my namespace.

Screenshot_762 Screenshot_763 Screenshot_764 Screenshot_765

Here is the related bundle file: bundle-2024-08-07-17-03-56.tar.gz

sseago commented 1 month ago

@jayrajmeh0 You're using version 0.1.1 of the CSI plugin. For Velero 1.13 you need 0.7.x.

jayrajmeh0 commented 1 month ago

@Lyndon-Li & @sseago Thank you both for your guidance and assistance with the backup process. I was able to take a snapshot backup using the flag --snapshot-move-data , which proves that I successfully completed this task.

Screenshot_773 Screenshot_774 Screenshot_775 Screenshot_776 Screenshot_777 Screenshot_783

The backup process has been successfully completed. However, when I removed the namespace associated with the backup shown in the screenshot above, I encountered an issue during my next task, which is restoring using Velero. The details of the issue are as follows:

Screenshot_778 Screenshot_780 Screenshot_781

Here is the related bundle file: bundle-2024-08-08-12-08-27.tar.gz

reasonerjt commented 1 month ago

By checking the logs I found following error messages:

time="2024-08-08T06:25:44Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:126"
time="2024-08-08T06:25:53Z" level=error msg="fail to get DataDownload: fail to list DataDownload: etcdserver: leader changed" Action=PVCRestoreItemAction Namespace=velero OperationID=dd-4703bc72-cbef-4a31-a0bf-8a3ad433d90f.be3cfbe8-fe0a-41d7f03b9 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/restore/pvc_action.go:240" pluginName=velero-plugin-for-csi
time="2024-08-08T06:25:53Z" level=info msg="Marking restore mongo-20240808115503 FinalizingPartiallyFailed" logSource="pkg/controller/restore_operations_controller.go:186" restore=velero/mongo-20240808115503

So it seems a random issue in etcd, if the problem is highly reproducible, you may want to check if there's any resource or perf issue in your cluster.

jayrajmeh0 commented 1 month ago

@reasonerjt I have checked all the pods, and everything looks good without any errors during the restore process. Can you help me resolve the issue in more detail?

@Lyndon-Li @sseago @reasonerjt Additionally, I have another question related to Velero and Ceph storage. Currently, I have a three-node Kubernetes cluster with Ceph storage. If any of these three nodes were to be destroyed and my database backup is stored in the cloud using Velero, can I restore it to a new Kubernetes environment? What are the prerequisites for this, particularly concerning Ceph storage pools and OSDs? Do I need to set up the same names or configurations for Velero to restore everything correctly?

jayrajmeh0 commented 3 weeks ago

@Lyndon-Li I successfully backed up one cluster and used the --snapshot-move-data flag to restore the backup on a second cluster. The restore process completed without any issues, and all statuses showed as successful. However, when I checked the MongoDB pod in the namespace of the new cluster, it was not functioning properly and encountered some issues.

Screenshot_822 Screenshot_823 Screenshot_827 Screenshot_828 Screenshot_824 Screenshot_825

Here is the related bundle file: bundle-2024-08-16-16-45-06.tar.gz

reasonerjt commented 3 weeks ago

@jayrajmeh0

As for the latest issue, it seems the infrastructure where you restore the mongo DB does not meet the requirement of mongoDB. I suggest you read the links in the error message to understand why it failed. IMO it is beyond the scope of velero.