vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.42k stars 1.37k forks source link

resource restore error: Timeout #7709

Open ffzzhong opened 2 months ago

ffzzhong commented 2 months ago

What steps did you take and what happened:

This issue happens kinda suddenly.

  1. I'm working on a cluster migration, in cluster A run Velero FSB, backup data stores in AWS S3, and in cluster B run restore
  2. using Velero 1.11.0 andAWS provider plugin v1.7.1, uploader type use default restic.
  3. I already successfully migrated 10+ services, their PV are relatively small (< 1G)
  4. then I start working on migrating a big PV (around 200G), the backup takes long time to finish, but the backup status finally becomes completed
  5. in cluster B, I run velero create --from-backup the-big-backup, it never success after a very long time, at the moment i think it's just too big, so didn't carefully check the logs, and I just manually stop it (by running velero restore delete, and manually delete the pod, statefulset, PVC, PV, etc)
  6. after this, I never had chance to successfully restore from any backup, the restore is always PartiallyFailed with 1 outstanding error: level=error msg="Namespace default, resource restore error: Timeout: request did not complete within requested timeout - context deadline exceeded" logSource="pkg/controller/restore_controller.go, and in the restore pod, the initContainer restore-wait always says Not found: /restores/data/.velero/xxxx-xxxx-xxxx-xxxx

what I tried:

so seems the backup is OK but the restore indeed has some issue, but I'm not sure what's the root cause of requested timeout and not found: /restore/data, I checked my cluster API server response time it's actually acceptable. I also don't understand why even I try to restore a previously successful backup, still leads to a failure.

What did you expect to happen: the restore should work, at least, should work for those backups with small size(they worked before)

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help yes, I'm using Velero 1.11.0 initially, but then I upgrade it to 1.12.2 trying to workaround, but still no luck, I will attach the debug info from 1.12.2 bundle-2024-04-21-14-35-13.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

reasonerjt commented 2 months ago

I checked the debug bundle you provided, in the restore venti-airflow-20240420-2-20240421135403 the only error was

time="2024-04-21T05:55:07Z" level=error msg="error patch for managed fields default/venti-airflow-postgresql-0: Timeout: request did not complete within requested timeout - context deadline exceeded" logSource="pkg/restore/restore.go:1731" restore=velero/venti-airflow-20240420-2-20240421135403

Could you check if all the resources are restored as expected and the data populated in this particular restore?

Could you please also check for other restores that were failed to complete are you seeing the same error?

ffzzhong commented 2 months ago

@reasonerjt this issue happens randomly and kinda suddenly, when I raised this issue, it indeed didn't work, no matter how small the backup is and how long I wait. the only error I see in the restore log is the one saying timeout - context deadline exceeded" logSource="pkg/controller/restore_controller.go:567 But, some time after I raised the PR, since I was keeping trying to restore, all of a sudden again, everything works as normal. and I was able to restore even the biggest backup(around 200G). Then comes to today, things go wrong again, same issue, time out, for any restore

for the questions you're asking:

  1. Actually the resources are created, I can see the svc, statefulset, pod, PVC, PV are all created, after that I run velero get backup, I see the status becomes PartiallyFailed very soon, and the data in the PV never gets populated, when this issue happens, even for a very small amount of backup, I always get the same error,
  2. I tried creating several restores, for all of them, I'm seeing the same error

for the error timeout - context deadline exceeded" logSource="pkg/controller/restore_controller.go:567, in Velero version v1.12.2, the code is https://github.com/vmware-tanzu/velero/blob/v1.12.2/pkg/controller/restore_controller.go#L565-L569, seems it's trying to collect some info from the namespace and it's failed? it's an error not a warning but I do see the resources are created in the desired namespace, what makes it an error?

ffzzhong commented 2 months ago

is it because of somehow my k8s API server is slow? as I mentioned, our cluster is a hybrid cluster, part of API calls go to the machines on the cloud, but