[Restore] Velero timed out waiting for all PodVolumeRestores to complete

leandreArturia commented 2 months ago

What steps did you take and what happened:

Installed Velero 1.13.0 with CSI snapshot via Helm (vmware-tanzu/velero --version 6.0.0) . Here is my configuration :

configuration:
    backupStorageLocation:
    - name: default
      provider: aws
      bucket: xxx
      caCert: xxx
      config:
        s3Url: minio_underCA
        pulicUrl: minio_underCA
        region: minio
        s3ForcePathStyle: true
    volumeSnapshotLocation:
    - name: default
      provider: aws
      config:
        region: minio
        s3ForcePathStyle: true
    features: EnableCSI
  snapshotsEnabled: true
  credentials:
    existingSecret: velero-credential
  initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.9.2
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
  - name: velero-plugin-for-csi
    image: velero/velero-plugin-for-csi:v0.7.0
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins

I have a minIO deployed with self-signed certificate.

I have a backup with some resources and a pvc backed up with volumesnapshots (a pretty large PVC : 160GB with ~117GB used) :

apiVersion: velero.io/v1
kind: Backup
metadata:
  annotations:
    velero.io/resource-timeout: 10m0s
    velero.io/source-cluster-k8s-gitversion: v1.26.9+rke2r1
    velero.io/source-cluster-k8s-major-version: "1"
    velero.io/source-cluster-k8s-minor-version: "26"
  creationTimestamp: "2024-04-17T12:59:35Z"
  generation: 7
  labels:
    velero.io/storage-location: default
  name: jenkins-rd-backup-manual
  namespace: velero
  resourceVersion: "244736415"
  uid: e9c0e140-3806-4963-bbda-05bc1585d94e
spec:
  csiSnapshotTimeout: 10m0s
  defaultVolumesToFsBackup: false
  hooks: {}
  includedNamespaces:
  - jenkins-rd
  itemOperationTimeout: 4h0m0s
  metadata: {}
  snapshotMoveData: false
  storageLocation: default
  ttl: 720h0m0s
  volumeSnapshotLocations:
  - default
status:
  backupItemOperationsAttempted: 2
  backupItemOperationsCompleted: 2
  completionTimestamp: "2024-04-17T13:36:41Z"
  csiVolumeSnapshotsAttempted: 1
  csiVolumeSnapshotsCompleted: 1
  expiration: "2024-05-17T13:01:42Z"
  formatVersion: 1.1.0
  hookStatus: {}
  phase: Completed
  progress:
    itemsBackedUp: 47
    totalItems: 47
  startTimestamp: "2024-04-17T13:01:42Z"
  version: 1

The podvolumebackup :

apiVersion: velero.io/v1
kind: PodVolumeBackup
metadata:
  annotations:
    velero.io/pvc-name: jenkins
  creationTimestamp: "2024-04-17T13:01:53Z"
  generateName: jenkins-rd-backup-manual-
  generation: 212
  labels:
    velero.io/backup-name: jenkins-rd-backup-manual
    velero.io/backup-uid: e9c0e140-3806-4963-bbda-05bc1585d94e
    velero.io/pvc-uid: 8cfb458c-be43-43f2-be73-a10fd80727c3
  name: jenkins-rd-backup-manual-wdzns
  namespace: velero
  ownerReferences:
  - apiVersion: velero.io/v1
    controller: true
    kind: Backup
    name: jenkins-rd-backup-manual
    uid: e9c0e140-3806-4963-bbda-05bc1585d94e
  resourceVersion: "244736381"
  uid: 9e36e859-afb9-461c-88bf-2492af7d5345
spec:
  backupStorageLocation: default
  node: master2
  pod:
    kind: Pod
    name: jenkins-0
    namespace: jenkins-rd
    uid: 37c1022c-c132-447f-b990-30eadcd8e833
  repoIdentifier: ""
  tags:
    backup: jenkins-rd-backup-manual
    backup-uid: e9c0e140-3806-4963-bbda-05bc1585d94e
    ns: jenkins-rd
    pod: jenkins-0
    pod-uid: 37c1022c-c132-447f-b990-30eadcd8e833
    pvc-uid: 8cfb458c-be43-43f2-be73-a10fd80727c3
    volume: jenkins-home
  uploaderType: kopia
  volume: jenkins-home
status:
  completionTimestamp: "2024-04-17T13:36:39Z"
  path: /host_pods/37c1022c-c132-447f-b990-30eadcd8e833/volumes/kubernetes.io~csi/pvc-8cfb458c-be43-43f2-be73-a10fd80727c3/mount
  phase: Completed
  progress:
    bytesDone: 117731913034
    totalBytes: 117731913034
  snapshotID: c734042f625a2e6bf8cd79cb7958d709
  startTimestamp: "2024-04-17T13:01:53Z"

When I try to restore this backup, I get a timeout after ~4 hours.

Output of kubectl -n velero get podvolumerestores -l velero.io/restore-name=jenkins-rd-backup-manual-20240422144640 -o yaml:

apiVersion: v1
items:
- apiVersion: velero.io/v1
  kind: PodVolumeRestore
  metadata:
    creationTimestamp: "2024-04-22T12:46:45Z"
    generateName: jenkins-rd-backup-manual-20240422144640-
    generation: 1
    labels:
      velero.io/pod-uid: 9d9b1fa0-b4e4-4a9e-ac8d-280e5ac4e72d
      velero.io/pvc-uid: c3afe3c3-e88a-4302-82f5-0cf30f16c56c
      velero.io/restore-name: jenkins-rd-backup-manual-20240422144640
      velero.io/restore-uid: 1c4295f3-f63b-4cce-9939-40f7d628e20e
    name: jenkins-rd-backup-manual-20240422144640-w6m78
    namespace: velero
    ownerReferences:
    - apiVersion: velero.io/v1
      controller: true
      kind: Restore
      name: jenkins-rd-backup-manual-20240422144640
      uid: 1c4295f3-f63b-4cce-9939-40f7d628e20e
    resourceVersion: "250182337"
    uid: e5e0350d-b014-469e-ac0e-03790a818d79
  spec:
    backupStorageLocation: default
    pod:
      kind: Pod
      name: jenkins-0
      namespace: jenkins-rd
      uid: 9d9b1fa0-b4e4-4a9e-ac8d-280e5ac4e72d
    repoIdentifier: ""
    snapshotID: c734042f625a2e6bf8cd79cb7958d709
    sourceNamespace: jenkins-rd
    uploaderType: kopia
    volume: jenkins-home
  status:
    progress: {}
kind: List
metadata:
  resourceVersion: ""

The progress is never updated.

And I get a timeout error in the logs :

time="2024-04-22T16:46:41Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1891" restore=velero/jenkins-rd-backup-manual-20240422144640
time="2024-04-22T16:46:41Z" level=error msg="Velero restore error: timed out waiting for all PodVolumeRestores to complete" logSource="pkg/controller/restore_controller.go:573" restore=velero/jenkins-rd-backup-manual-20240422144640

The volume is created but empty.

Logs of the restore and velero debug --restore jenkins-rd-backup-manual-20240422144640 :

bundle-2024-04-24-15-22-16.tar.gz

velero_restore_logs.txt

What did you expect to happen:

To get a full restore.

The following information will help us better understand what's going on:

Environment:

Velero version (use velero version): Server 1.13.0 Client 1.9.1
Velero features (use velero client config get features): None
Kubernetes version (use kubectl version): Server Version: v1.26.9+rke2r1
Kubernetes installer & version: rke2r1
Cloud provider or hardware configuration: bare metal
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

allenxu404 commented 2 months ago

Based on the status of the PodVolumeRestore, it appears that the PodVolumeRestore was not processed by the node agent:

 status:
    progress: {}

To troubleshoot this issue, please check the overall state of your cluster. Specifically, you may want to verify:

The node agent has been successfully deployed on the associated node.
The associated pod has been restored and is running as expected.

Reviewing these cluster conditions should help you identify the root cause and resolve the issue.

leandreArturia commented 2 months ago

The node agent is correctly installed and running as daemonset on every node of the cluster.

The pod is restored but in pending state with the error :

0/6 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..

And my PVC is in a pending state also with the error (don't check the uid of the pvc, it's a test to restore a little application) :

failed to provision volume with StorageClass "longhorn": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [message=unable to create volume: unable to create volume pvc-b2e25434-d8fc-45da-b709-2b2c6055d235: failed to verify data source: volume.longhorn.io "pvc-ce514251-9e56-4999-9e2d-bffae4ceed16" not found, code=Server Error, detail=] from [http://longhorn-backend:9500/v1/volumes]

After a little search, I did find an issue on longhorn that is relevant in my case, because the PVC has its datasource set to the volumesnapshot for the restore : https://github.com/longhorn/longhorn/issues/4083

I think my PVC (that is linked to the pod to restore) is created but not restored correctly :

I use longhorn 1.4.0 as storage class, and it has some limitation when restore PVC from CSI VolumeSnapshot Associated with Longhorn Snapshot : volume must be in the attached state to create/restore snapshot using CSI VolumeSnapshot mechanism.

Don't hesitate if you see that I'm on the wrong track.

I will update longhorn and update this issue when it's done.

blackpiglet commented 1 month ago

I'm confused about this issue.

First, the Velero Helm Chart values file doesn't specify the deployNodeAgent, so I suppose the node-agent DaemonSet should not be installed with the Helm Chart.
Second, say the node-agent was installed later, and the PodVolumeBackup was created during backup. That should not related to the VolumeSnapshot issue.

Then please check whether the backed-up volume's mounting pod is annotated as backup.velero.io/backup-volumes=<volume-name>. And please check the uploader of the filesystem backup is Restic or Kopia. The PVB says it uses Kopia, but the restore says it uses Restic. Another thing worth notice is the Velero client and server version mismatch. Please align them.

Client:
    Version: v1.9.1
    Git commit: e4c84b7b3d603ba646364d5571c69a6443719bf2
Server:
    Version: v1.13.0

And please use the v1.13.2 version of Velero. It has some bug fixes.

vmware-tanzu / velero

[Restore] Velero timed out waiting for all PodVolumeRestores to complete #7735