vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.57k stars 1.39k forks source link

restore is stuck on restic restore and showing "New" for all restic volumes and not progressing #8043

Open adityagu0910 opened 1 month ago

adityagu0910 commented 1 month ago

What steps did you take and what happened: Ran below restore command after deleting the deployment. It should restore deployment with its PVC. velero restore create --from-backup

But it is stuck in "InProgress" status and i see below in describe. All restic restore stuck at "New" status

Restic Restores: New: cp4i/ibm-nginx-7586547695-swcrn: user-home-mount cp4i/management-1e683816-postgres-backrest-shared-repo-7fdd48f7xtnhk: backrestrepo, pgbackrest-conf cp4i/management-1e683816-postgres-pgbouncer-dfdd9c8d-v4dxj: tmp cp4i/zen-core-api-6fccb8f89b-5wktv: metastore-volume cp4i/zen-watcher-ddb6c6fc7-9n6rp: metastore-volume

What did you expect to happen:

restore to complete.

The following information will help us better understand what's going on:

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 1 month ago

Could you help to collect the debug bundle of the failed restore by running CLI velero debug?

adityagu0910 commented 1 month ago

restore in still in "In progress" status so seems velero debug is not able to collect logs.

`2024/07/24 08:25:08 Collecting log and information for backup: 2024/07/24 08:25:28 Collecting log and information for restore: An error occurred: exec failed: Traceback (most recent call last): velero-debug-collector:28:21: in velero-debug-collector:14:22: in capture_restore_logs

: in capture Error: capture_local: exit status 1`
blackpiglet commented 1 month ago

OK. Please check whether there are node-agent pods in the Velero installed namespace. And what is the version of your environment's Velero?

adityagu0910 commented 1 month ago

we have only restic and velero pods running. below is the version

Client: Version: v1.7.0 Git commit: 9e52260568430ecb77ac38a677ce74267a8c2176 Server: Version: v1.7.0

blackpiglet commented 1 month ago

v1.7.0 was released years ago. It's already out of maintaining scope. Could you please try to bump the Velero version you are using? I suggest v1.13.2 or 1.14.0.

adityagu0910 commented 1 month ago

our backups are taken from this version, will restore work with latest version of velero on the backups taken from version 1.7.0?

blackpiglet commented 1 month ago

In most cases, it should work, but, of course, there were some breaking changes along the way. What's the k8s version you are using? That should be considered too. https://github.com/vmware-tanzu/velero/tree/release-1.9?tab=readme-ov-file#velero-compatibility-matrix

adityagu0910 commented 1 month ago

we have below k8s and ocp version

OpenShift version :4.14.27

Kubernetes version : v1.27.13+048520e

blackpiglet commented 1 month ago

Then the k8s version is not a blocker. I suggest upgrading the Velero version according to this document. https://velero.io/docs/v1.14/upgrade-to-1.14/