vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.45k stars 1.37k forks source link

velero restore create --from-backup test-backup failed - An error occurred: file not found #7663

Closed dove-young closed 3 weeks ago

dove-young commented 3 months ago

What steps did you take and what happened:

I restored a backup from OCP cluster to another OCP cluster, the restore failed after a few moment. My storage is Minio

I checked velero restore logs, it said An error occurred: file not found

[root@api.o1-726938 ]# velero restore logs test-backup-20240411204147
An error occurred: file not found
[root@api.o1-726938 ]# 

What did you expect to happen:

restore will finish success

The following information will help us better understand what's going on:

bundle-2024-04-11-20-51-32.tar.gz

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

dove-young commented 3 months ago

Here are what I did

[root@api.o1-726938]# velero backup get test-backup
NAME          STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
test-backup   Completed   0        0          2024-04-11 20:30:39 -0700 PDT   29d       default            <none>
[root@api.o1-726938]#
[root@api.o1-726938]# velero restore describe test-backup-20240411204147
Name:         test-backup-20240411204147
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:                       Failed (run 'velero restore logs test-backup-20240411204147' for more information)
Total items to be restored:  7848
Items restored:              400

Started:    2024-04-11 20:41:47 -0700 PDT
Completed:  2024-04-11 20:43:09 -0700 PDT

Backup:  test-backup

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Or label selector:  <none>

Restore PVs:  auto

Existing Resource Policy:   <none>
ItemOperationTimeout:       4h0m0s

Preserve Service NodePorts:  auto

[root@api.o1-726938]#

[root@api.o1-726938]# velero backup describe test-backup
Name:         test-backup
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.28.7+f1b5f6c
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=28

Phase:  Completed

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-04-11 20:30:39 -0700 PDT
Completed:  2024-04-11 20:32:05 -0700 PDT

Expiration:  2024-05-11 20:30:39 -0700 PDT

Total items to be backed up:  8983
Items backed up:              8983

Backup Volumes:
  Velero-Native Snapshots: <none included>

  CSI Snapshots: <none included>

  Pod Volume Backups: <none included>

HooksAttempted:  0
HooksFailed:     0
[root@api.o1-726938]#
dove-young commented 3 months ago

I can find files in my minio bucket

[root@api.o1-726938]# mc ls st2-tunnel-dev1/veleco-hgliu/backups/test-backup/
[2024-04-11 20:32:03 PDT]    29B STANDARD test-backup-csi-volumesnapshotclasses.json.gz
[2024-04-11 20:32:03 PDT]    29B STANDARD test-backup-csi-volumesnapshotcontents.json.gz
[2024-04-11 20:32:04 PDT]    29B STANDARD test-backup-csi-volumesnapshots.json.gz
[2024-04-11 20:32:02 PDT]    27B STANDARD test-backup-itemoperations.json.gz
[2024-04-11 20:31:45 PDT] 234KiB STANDARD test-backup-logs.gz
[2024-04-11 20:32:02 PDT]    29B STANDARD test-backup-podvolumebackups.json.gz
[2024-04-11 20:32:02 PDT]  69KiB STANDARD test-backup-resource-list.json.gz
[2024-04-11 20:32:04 PDT]    49B STANDARD test-backup-results.gz
[2024-04-11 20:32:03 PDT] 1.2KiB STANDARD test-backup-volumeinfo.json.gz
[2024-04-11 20:32:02 PDT]    29B STANDARD test-backup-volumesnapshots.json.gz
[2024-04-11 20:32:01 PDT]  22MiB STANDARD test-backup.tar.gz
[2024-04-11 20:32:05 PDT] 2.9KiB STANDARD velero-backup.json
[root@api.o1-726938]#
dove-young commented 3 months ago

I created a new backup and tried again, the restore failed with the same error.

Backup

[root@st2-bastion-dev21]# velero backup describe test-backup-2
Name:         test-backup-2
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.28.7+f1b5f6c
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=28

Phase:  Completed

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-04-11 22:20:47 -0700 PDT
Completed:  2024-04-11 22:22:04 -0700 PDT

Expiration:  2024-05-11 22:20:47 -0700 PDT

Total items to be backed up:  9426
Items backed up:              9426

Backup Volumes:
  Velero-Native Snapshots: <none included>

  CSI Snapshots: <none included>

  Pod Volume Backups: <none included>

HooksAttempted:  0
HooksFailed:     0

Restore

Restore completed with status: Failed. You may check for more information using the commands `velero restore describe test-backup-2-20240411222311` and `velero restore logs test-backup-2-20240411222311`.
[root@api.o1-726938]# velero restore logs test-backup-2-20240411222311
An error occurred: file not found
[root@api.o1-726938]# velero restore describe test-backup-2-20240411222311
Name:         test-backup-2-20240411222311
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:                       Failed (run 'velero restore logs test-backup-2-20240411222311' for more information)
Total items to be restored:  7849
Items restored:              241

Started:    2024-04-11 22:23:11 -0700 PDT
Completed:  2024-04-11 22:24:05 -0700 PDT

Backup:  test-backup-2

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Or label selector:  <none>

Restore PVs:  auto

Existing Resource Policy:   <none>
ItemOperationTimeout:       4h0m0s

Preserve Service NodePorts:  auto

[root@api.o1-726938]# mc ls st2-tunnel-dev1/veleco-hgliu/backups/test-backup-2
[2024-04-11 22:22:02 PDT]    29B STANDARD test-backup-2-csi-volumesnapshotclasses.json.gz
[2024-04-11 22:22:02 PDT]    29B STANDARD test-backup-2-csi-volumesnapshotcontents.json.gz
[2024-04-11 22:22:01 PDT]    29B STANDARD test-backup-2-csi-volumesnapshots.json.gz
[2024-04-11 22:22:01 PDT]    27B STANDARD test-backup-2-itemoperations.json.gz
[2024-04-11 22:21:48 PDT] 247KiB STANDARD test-backup-2-logs.gz
[2024-04-11 22:22:03 PDT]    29B STANDARD test-backup-2-podvolumebackups.json.gz
[2024-04-11 22:22:02 PDT]  73KiB STANDARD test-backup-2-resource-list.json.gz
[2024-04-11 22:22:02 PDT]    49B STANDARD test-backup-2-results.gz
[2024-04-11 22:22:03 PDT] 1.1KiB STANDARD test-backup-2-volumeinfo.json.gz
[2024-04-11 22:22:03 PDT]    29B STANDARD test-backup-2-volumesnapshots.json.gz
[2024-04-11 22:22:01 PDT]  22MiB STANDARD test-backup-2.tar.gz
[2024-04-11 22:22:04 PDT] 2.9KiB STANDARD velero-backup.json
[root@api.o1-726938]#
dove-young commented 3 months ago

from velero pod log, I can find there are lots of restore records, but at the end of them, there are also some failures like

Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource

time="2024-04-12T07:08:10Z" level=info msg="Resource 'statefulsets.apps' will be restored into namespace 'instana-clickhouse'" logSource="pkg/restore/restore.go:2264" restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Resource 'storages.operator.openshift.io' will be restored at cluster scope" logSource="pkg/restore/restore.go:2266" restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Resource 'storageversionmigrations.migration.k8s.io' will be restored at cluster scope" logSource="pkg/restore/restore.go:2266" restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Resource 'strimzipodsets.core.strimzi.io' will be restored into namespace 'instana-kafka'" logSource="pkg/restore/restore.go:2264" restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Resource 'templates.template.openshift.io' will be restored into namespace 'openshift'" logSource="pkg/restore/restore.go:2264" restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Resource 'tuneds.tuned.openshift.io' will be restored into namespace 'openshift-cluster-node-tuning-operator'" logSource="pkg/restore/restore.go:2264" restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Resource 'units.instana.io' will be restored into namespace 'instana-units'" logSource="pkg/restore/restore.go:2264" restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Resource 'validatingwebhookconfigurations.admissionregistration.k8s.io' will be restored at cluster scope" logSource="pkg/restore/restore.go:2266" restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Skipping restore of resource because the restore spec excludes it" logSource="pkg/restore/restore.go:2202" resource=volumeattachments.storage.k8s.io restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Resource 'zookeeperclusters.zookeeper.pravega.io' will be restored into namespace 'instana-clickhouse'" logSource="pkg/restore/restore.go:2264" restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=clusterbootstraps.run.tanzu.vmware.com restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=clusters.cluster.x-k8s.io restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=clusterresourcesets.addons.cluster.x-k8s.io restore=velero/test-backup-2-20240412000742
time="2024-04-12T07:08:10Z" level=info msg="waiting informer cache sync ..." logSource="pkg/restore/restore.go:604" restore=velero/test-backup-2-20240412000742
E0412 07:08:10.381344       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource
E0412 07:08:10.392096       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource
I0412 07:08:11.273994       1 request.go:690] Waited for 1.089987061s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/monitoring.coreos.com/v1/servicemonitors?limit=500&resourceVersion=0
E0412 07:08:11.289286       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource
E0412 07:08:11.897525       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource
dove-young commented 3 months ago

I checked clusterrolebinding, it binds to cluster admin. it sounds good.

[root@api.o1-726938~]# oc get ClusterRoleBinding/velero
NAME     ROLE                        AGE
velero   ClusterRole/cluster-admin   52m
[root@api.o1-726938 ~]# 
sseago commented 3 months ago

Looking at the bundle you uploaded, it looks like the Velero pod restarted while your restore was running, so on restart, velero marked it as failed since velero can't restart an in-progress backup or restore if it crashes in the middle. You'd need to rerun the restore. The reason the velero restore logs command failed is that restore logs are only uploaded upon completion of restore, so since the restore never finished, the logs weren't uploaded.

Next step would be to figure out why your velero pod crashed. Does it restart often" Did the restore cause OOM issues? If so, maybe bumping up memory/cpu resources in your velero deployment would help.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been stalled for 14 days with no activity.