vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.59k stars 1.39k forks source link

Timed out awaiting reconciliation of volumesnapshot #7427

Open ShubhamTatvamasi opened 7 months ago

ShubhamTatvamasi commented 7 months ago

bundle-2024-02-14-16-16-03.tar.gz

➜  ~ velero describe backups ops-6hrly-backup-20240209120010
Name:         ops-6hrly-backup-20240209120010
Namespace:    velero
Labels:       argocd.argoproj.io/instance=k8s-configs-ops
              velero.io/schedule-name=ops-6hrly-backup
              velero.io/storage-location=default
Annotations:  kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"velero.io/v1","kind":"Schedule","metadata":{"annotations":{},"labels":{"argocd.argoproj.io/instance":"k8s-configs-ops"},"name":"ops-6hrly-backup","namespace":"velero"},"spec":{"schedule":"0 6,12,18 * * *","template":{"hooks":{},"includedNamespaces":["*"],"ttl":"336h0m0s"}}}

  velero.io/resource-timeout=10m0s
  velero.io/source-cluster-k8s-gitversion=v1.27.8
  velero.io/source-cluster-k8s-major-version=1
  velero.io/source-cluster-k8s-minor-version=27

Phase:  PartiallyFailed (run `velero backup logs ops-6hrly-backup-20240209120010` for more information)

Warnings:
  Velero:     <none>
  Cluster:   resource: /persistentvolumes name: /rstudio-data
  Namespaces: <none>

Errors:
  Velero:
             name: /keycloakx-pgsql-0 error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=keycloakx, name=velero-pgdata-keycloakx-pgsql-0-7qtpc): rpc error: code = Unknown desc = timed out waiting for the condition
  Cluster:    <none>
  Namespaces: <none>

Namespaces:
  Included:  *
  Excluded:  graylog

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  336h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-02-09 17:30:10 +0530 IST
Completed:  2024-02-09 17:40:44 +0530 IST

Expiration:  2024-02-23 17:30:10 +0530 IST

Total items to be backed up:  2488
Items backed up:              2488

Backup Volumes:
  <error getting backup volume info: DownloadRequest.velero.io "ops-6hrly-backup-20240209120010-02b56e67-5c36-46ee-9667-86acd21f038a" is invalid: spec.target.kind: Unsupported value: "BackupVolumeInfos": supported values: "BackupLog", "BackupContents", "BackupVolumeSnapshots", "BackupItemOperations", "BackupResourceList", "BackupResults", "RestoreLog", "RestoreResults", "RestoreResourceList", "RestoreItemOperations", "CSIBackupVolumeSnapshots", "CSIBackupVolumeSnapshotContents">
➜  ~ velero backup logs ops-6hrly-backup-20240209120010 | grep "level=error"
time="2024-02-09T12:10:15Z" level=error msg="Timed out awaiting reconciliation of volumesnapshot keycloakx/velero-pgdata-keycloakx-pgsql-0-7qtpc" backup=velero/ops-6hrly-backup-20240209120010 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:192" pluginName=velero-plugin-for-csi
time="2024-02-09T12:10:15Z" level=error msg="Error backing up item" backup=velero/ops-6hrly-backup-20240209120010 error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=keycloakx, name=velero-pgdata-keycloakx-pgsql-0-7qtpc): rpc error: code = Unknown desc = timed out waiting for the condition" logSource="pkg/backup/backup.go:448" name=keycloakx-pgsql-0
➜  ~ velero version
Client:
    Version: v1.13.0
    Git commit: -
Server:
    Version: v1.12.2
ywk253100 commented 7 months ago

Could you check the status of the VolumeSnapshotContent to confirm whether the Status.SnapshotHandle isn't nil after a longer time(the default timeout is 10 mins)?

If the Status.SnapshotHandle is ready after 10 mins, you can increase the timeout value when creating the backup by specifying the option --csi-snapshot-timeout.

alievrouw commented 7 months ago

I ran into this same problem, and it turned out to be the CLI version vs the server version. CLI v1.13.0 mentions breaking changes when using the backup describe command:

https://github.com/vmware-tanzu/velero/releases/tag/v1.13.0

I'm running server version v1.11.1, and downgrading from CLI v1.13.0 to v1.11.1 made this error go away. I hope this helps.

Lyndon-Li commented 7 months ago

Here we see two problems:

  1. Backup describe doesn't work and is with error error getting backup volume info
  2. CSI snapshot timeout

For 1, please follow @alievrouw's suggestion; for 2, please follow @ywk253100 mentioned the solution.