resource restore error: Timeout

ffzzhong commented 2 months ago

What steps did you take and what happened:

This issue happens kinda suddenly.

I'm working on a cluster migration, in cluster A run Velero FSB, backup data stores in AWS S3, and in cluster B run restore
using Velero 1.11.0 andAWS provider plugin v1.7.1, uploader type use default restic.
I already successfully migrated 10+ services, their PV are relatively small (< 1G)
then I start working on migrating a big PV (around 200G), the backup takes long time to finish, but the backup status finally becomes completed
in cluster B, I run velero create --from-backup the-big-backup, it never success after a very long time, at the moment i think it's just too big, so didn't carefully check the logs, and I just manually stop it (by running velero restore delete, and manually delete the pod, statefulset, PVC, PV, etc)
after this, I never had chance to successfully restore from any backup, the restore is always PartiallyFailed with 1 outstanding error: level=error msg="Namespace default, resource restore error: Timeout: request did not complete within requested timeout - context deadline exceeded" logSource="pkg/controller/restore_controller.go, and in the restore pod, the initContainer restore-wait always says Not found: /restores/data/.velero/xxxx-xxxx-xxxx-xxxx

what I tried:

try to restore the big backup again, restore failed
try to restore a previously successful backup, with very small size, restore failed
try to create a new backup with small size and restore, backup OK but restore failed
try to create a new backup location in both clusters, using Kopia uploader type, backup OK but restore failed
try to set the node-agent CPU and memery request and unbound limit, restore failed
try to upgrade Velero to 1.12.2 and upgrade AWS provider plugin to 1.8.2 accordingly, restore failed

so seems the backup is OK but the restore indeed has some issue, but I'm not sure what's the root cause of requested timeout and not found: /restore/data, I checked my cluster API server response time it's actually acceptable. I also don't understand why even I try to restore a previously successful backup, still leads to a failure.

What did you expect to happen: the restore should work, at least, should work for those backups with small size(they worked before)

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help yes, I'm using Velero 1.11.0 initially, but then I upgrade it to 1.12.2 trying to workaround, but still no luck, I will attach the debug info from 1.12.2 bundle-2024-04-21-14-35-13.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:

Environment:

Velero version (use velero version): tried 1.11.0, 1.12.2
Velero features (use velero client config get features): features:
Kubernetes version (use kubectl version): v1.25.6
Kubernetes installer & version:
Cloud provider or hardware configuration: hybrid cluster, on-cloud control-planes and on-prem nodes, using on-prem iscsi service as storage backend
OS (e.g. from /etc/os-release): ubuntu 22.04

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

reasonerjt commented 2 months ago

I checked the debug bundle you provided, in the restore venti-airflow-20240420-2-20240421135403 the only error was

time="2024-04-21T05:55:07Z" level=error msg="error patch for managed fields default/venti-airflow-postgresql-0: Timeout: request did not complete within requested timeout - context deadline exceeded" logSource="pkg/restore/restore.go:1731" restore=velero/venti-airflow-20240420-2-20240421135403

Could you check if all the resources are restored as expected and the data populated in this particular restore?

Could you please also check for other restores that were failed to complete are you seeing the same error?

ffzzhong commented 2 months ago

@reasonerjt this issue happens randomly and kinda suddenly, when I raised this issue, it indeed didn't work, no matter how small the backup is and how long I wait. the only error I see in the restore log is the one saying timeout - context deadline exceeded" logSource="pkg/controller/restore_controller.go:567 But, some time after I raised the PR, since I was keeping trying to restore, all of a sudden again, everything works as normal. and I was able to restore even the biggest backup(around 200G). Then comes to today, things go wrong again, same issue, time out, for any restore

for the questions you're asking:

Actually the resources are created, I can see the svc, statefulset, pod, PVC, PV are all created, after that I run velero get backup, I see the status becomes PartiallyFailed very soon, and the data in the PV never gets populated, when this issue happens, even for a very small amount of backup, I always get the same error,
I tried creating several restores, for all of them, I'm seeing the same error

for the error timeout - context deadline exceeded" logSource="pkg/controller/restore_controller.go:567, in Velero version v1.12.2, the code is https://github.com/vmware-tanzu/velero/blob/v1.12.2/pkg/controller/restore_controller.go#L565-L569, seems it's trying to collect some info from the namespace and it's failed? it's an error not a warning but I do see the resources are created in the desired namespace, what makes it an error?

ffzzhong commented 2 months ago

is it because of somehow my k8s API server is slow? as I mentioned, our cluster is a hybrid cluster, part of API calls go to the machines on the cloud, but

when issue happens, I check my k8s API performance, the response time is around 150ms, and there's no dropped request
why sometimes everything works quite normally, but suddenly they stop working at all(I didn't do any change)

vmware-tanzu / velero

resource restore error: Timeout #7709