vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

timed out waiting for all PodVolumeRestores to complete #4552

Open manish-soni1 opened 2 years ago

manish-soni1 commented 2 years ago

Hi,

I am taking a backup of one of the project on a openshift cluster using velero which is successful and trying to restore the same on another openshift cluster which is failing with below error:

error msg="unable to successfully complete restic restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1433

Velero version: v1.7.0 Kubernetes version: v1.21.1+9807387

attached the velero logs as well. veleroRestoreLogs.txt

Thanks & Regards, Manish Soni

qiuming-best commented 2 years ago

through the log, I found the whole restore period duration of about 4hours, so it ended with a timeout error. So, the data you want to restore is very big in size?

manish-soni1 commented 2 years ago

it's only in KBs, all pods and PVs also getting restored on 2nd openshift cluster but the data is not getting restored which is only few rows.

qiuming-best commented 2 years ago

@Manish-Soni1 could you provide the restic pod logs for me? It seems that restic doing the restore with timeout

qiuming-best commented 2 years ago

@Manish-Soni1 Here is the bug report template, you can follow the step to provide us as much as logs to track this issue.

manish-soni1 commented 2 years ago

@qiuming-best I tried to restore again today and collected all fresh logs which i have attached herein restoreLogs.txt velero_resticPodslogs.txt backupDescribe.txt backupLogs.txt restoreDescribe.txt .

prad6588 commented 2 years ago

any response to this ticket? i am getting the same issue

prad6588 commented 2 years ago

While running a Backup it is stuck and timing out after 4 hours. The error in log as below

timed out waiting for all PodVolumeBackups to complete

Error time="2022-05-12T15:51:21Z" level=error msg="Error backing up item" backup=velero/cluster-2022-05-12-11-51-11 error="timed out waiting for all PodVolumeBackups to complete" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:177" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417"

Backup command used

./velero-v1.8.1-linux-amd64/velero backup create cluster-"${datebkp}" --default-volumes-to-restic

sseago commented 2 years ago

Could you provide the yaml for the PodVolumeRestores? From the restore logs, Velero created the PVRs and waited until the timeout for them to complete, but they weren't done yet. I suspect they weren't actually started for some reason. Also, the restic pod logs text file above is actually the velero pod log. If the PodVolumeRestore was created for your pod+volume, then the restic pod on that node should have picked it up. It's possible that the restic pod on that node isn't running or is somehow unhealthy. Also, there could be an issue with the ResticRepository, so it might be good to provide the yaml for those resources as well.

prad6588 commented 2 years ago

Hello..

Now i am skipping the PV backup and doing a regular restore. While restoring i see the Ingress in not getting restored and get the below error. can you help pls

error restoring ingresses.networking.k8s.io: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://devnginx-ingress-nginx/networking/v1/ingresses?timeout=10s": service

fhageman commented 1 year ago

Hello..

Now i am skipping the PV backup and doing a regular restore. While restoring i see the Ingress in not getting restored and get the below error. can you help pls

error restoring ingresses.networking.k8s.io: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://devnginx-ingress-nginx/networking/v1/ingresses?timeout=10s": service

Had the same issue, before restoring you need to delete your nginx-ingress validatingwebhook

Lyndon-Li commented 6 months ago

Looks like this is another issue related to webhook --- the network is not ready so the init container is not executed, as a result, Velero podVolumeRestore keeps waiting for the completion of the init container.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

blackpiglet commented 2 months ago

unstale

github-actions[bot] commented 4 days ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.