vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.66k stars 1.4k forks source link

Azure Files error restoring restic volume: invalid id "": no matching ID found #2007

Closed reddog335 closed 4 years ago

reddog335 commented 4 years ago

What steps did you take and what happened:

We are using Azure Files in AKS for our persistent storage solution. I have a deployment using Azure Files for the perisistent volume claim and persistent volume. I have installed velero with restic. The backups appear to backup the pvc and pv just fine; however, when I perform the restore it creates a new pv with a different name and a second Azure File share in the storage account with no files in it.

Before I removed the ops-itg namespace container the pv and pvc:

kubectl get pv,pvc NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/pvc-e495741c-f995-11e9-ae16-260815deaa94 5Gi RWX Retain Bound ops-itg/ops-itg-azure-file azure-file-std-grs 8m14s

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/ops-itg-azure-file Bound pvc-e495741c-f995-11e9-ae16-260815deaa94 5Gi RWX azure-file-std-grs 8m14s

Restore Commands: velero get backups NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR aks-daily-backup-itg-20191028152145 Completed 2019-10-28 10:21:45 -0500 CDT 29d default

velero restore create --from-backup aks-daily-backup-itg-20191028152145

After the restore completed: kubectl get pv,pvc NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/pvc-27106be3-f999-11e9-ae16-260815deaa94 5Gi RWX Retain Bound ops-itg/ops-itg-azure-file azure-file-std-grs 15m

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/ops-itg-azure-file Bound pvc-27106be3-f999-11e9-ae16-260815deaa94 5Gi RWX azure-file-std-grs 15m

What did you expect to happen:

I expected the pv to be recreated with the same name and the file in the Azure File share to be mounted to the pods in the deployment.

The output of the following commands will help us better understand what's going on: https://gist.github.com/reddog335/6865b38fbdfa4e6caa71cc7f83b8d10b

Environment:

skriss commented 4 years ago

Can you provide the output of kubectl -n velero get podvolumebackups -l velero.io/backup-name=aks-daily-backup-itg-20191028152145 -o yaml?

FYI, the restic integration relies on dynamic provisioning to restore volumes -- so during a restore, it's expected behavior to get a new, dynamically-provisioned PV that should get the backed-up data restore into.

reddog335 commented 4 years ago

Thanks for looking at this @skriss !

apiVersion: v1 items:

reddog335 commented 4 years ago

This issue is still tagged as 'Waiting for info'... do you need any further information from me?

reddog335 commented 4 years ago

Is there anything I need to do on my end to progress this issue (apologies, this is my first issue submitted and not sure of the protocol)

On Mon, Oct 28, 2019 at 5:17 PM Steve Kriss notifications@github.com wrote:

Can you provide the output of kubectl -n velero get podvolumebackups -l velero.io/backup-name=aks-daily-backup-itg-20191028152145 -o yaml?

FYI, the restic integration relies on dynamic provisioning to restore volumes -- so during a restore, it's expected behavior to get a new, dynamically-provisioned PV that should get the backed-up data restore into.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vmware-tanzu/velero/issues/2007?email_source=notifications&email_token=AI2V6SCOD2J4LUTLV3C4QNDQQ5QJPA5CNFSM4JF45SU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECOSKTI#issuecomment-547169613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI2V6SFVKWGBUUE6WMG44ZLQQ5QJPANCNFSM4JF45SUQ .

skriss commented 4 years ago

@reddog335 apologies for the delayed response, we're working on finalizing v1.2 at the moment :)

The YAML you sent me indicates that velero/restic didn't find any files in any of the volumes (message: volume was empty so no snapshot was taken). Can you confirm that there was in fact data in each of those at the time of backup?

reddog335 commented 4 years ago

@skriss No worries, I appreciate the help. There was definitely data in the volumes. It's a text file where all three replicas write to simultaneously. After the backup I removed the text file from the Azure File to test the restore. I can run another test if you'd like.

skriss commented 4 years ago

Hmm, not sure why restic wouldn't be finding the file.

The way the restic backups work is that the velero/restic daemonset uses a hostPath mount of /var/lib/kubelet/pods, which is the directory on each node in the cluster where pod volumes are mounted. If you look in the YAML above, you'll see a backup path of e.g. /host_pods/7ba88096-f996-11e9-ae16-260815deaa94/volumes/kubernetes.io~azure-file/pvc-e495741c-f995-11e9-ae16-260815deaa94 (/host_pods is the location in the daemonset pod where the /var/lib/kubelet/pods directory is mounted).

It'd be really helpful if you could run another test, and prior to backup, do the following:

  1. find the velero/restic daemonset pod running on the same node as one of your workload pods
  2. exec into that velero/restic daemonset pod and do an ls -la on the volume directory: /host_pods/<your-workload-pod-uid>/volumes/kubernetes.io~azure-file/<pv-name>
  3. if possible, also SSH into the node itself, and do a similar ls -la from there on /var/lib/kubelet/pods/<your-workload-pod-uid>/volumes/kubernetes.io~azure-file/<pv-name>
reddog335 commented 4 years ago

@skriss I received the same errors with this test and the newly created Azure File Share was empty after the restore. Below are the details:

kubectl exec -it restic-mj29r -n velero – bash

root@restic-mj29r:/# ls -rlt /host_pods/edb00905-fe4c-11e9-9d9f-2ef301709b43/volumes/kubernetes.io~azure-file/pvc-a092dd45-fe4b-11e9-9d9f-2ef301709b43 total 9 -rwxr-xr-x 1 1000 1000 9080 Nov 3 15:11 gpfs_test.txt

root@restic-mj29r:/# tail -10 /host_pods/edb00905-fe4c-11e9-9d9f-2ef301709b43/volumes/kubernetes.io~azure-file/pvc-a092dd45-fe4b-11e9-9d9f-2ef301709b43/gpfs_test.txt azure-files-test-itg-6bfb6cbd5d-n8kt7:76 azure-files-test-itg-6bfb6cbd5d-jl566:75 azure-files-test-itg-6bfb6cbd5d-vgnpr:74 azure-files-test-itg-6bfb6cbd5d-n8kt7:77 azure-files-test-itg-6bfb6cbd5d-jl566:76

root@aks-nodepool1-26580627-vmss000002:~# tail -5 /var/lib/kubelet/pods/edb00905-fe4c-11e9-9d9f-2ef301709b43/volumes/kubernetes.io~azure-file/pvc-a092dd45-fe4b-11e9-9d9f-2ef301709b43/gpfs_test.txt azure-files-test-itg-6bfb6cbd5d-n8kt7:76 azure-files-test-itg-6bfb6cbd5d-jl566:75 azure-files-test-itg-6bfb6cbd5d-vgnpr:74 azure-files-test-itg-6bfb6cbd5d-n8kt7:77 azure-files-test-itg-6bfb6cbd5d-jl566:76

velero backup create manual-backup --exclude-namespaces velero,default --snapshot-volumes=true

velero get backup NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR aks-daily-backup-itg-20191103150531 Completed 2019-11-03 09:05:31 -0600 CST 29d default aks-daily-backup-itg-20191101020026 Completed 2019-10-31 21:00:26 -0500 CDT 27d default aks-daily-backup-itg-20191031020025 Completed 2019-10-30 21:00:25 -0500 CDT 26d default aks-daily-backup-itg-20191030020025 Completed 2019-10-29 21:00:25 -0500 CDT 25d default aks-daily-backup-itg-20191029020025 Completed 2019-10-28 21:00:25 -0500 CDT 24d default aks-daily-backup-itg-20191028152145 Completed 2019-10-28 10:21:45 -0500 CDT 23d default manual-backup Completed 2019-11-03 09:33:28 -0600 CST 29d default

kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE vcs-itg-azure-file Bound pvc-a092dd45-fe4b-11e9-9d9f-2ef301709b43 5Gi RWX azure-file-std-grs 30m

kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-a092dd45-fe4b-11e9-9d9f-2ef301709b43 5Gi RWX Retain Bound vcs-itg/vcs-itg-azure-file azure-file-std-grs 30m

kubectl delete ns vcs-itg namespace "vcs-itg" deleted

kubectl get pvc No resources found.

kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-a092dd45-fe4b-11e9-9d9f-2ef301709b43 5Gi RWX Retain Released vcs-itg/vcs-itg-azure-file azure-file-std-grs 32m

velero restore create --from-backup manual-backup Restore request "manual-backup-20191103094041" submitted successfully. Run velero restore describe manual-backup-20191103094041 or velero restore logs manual-backup-20191103094041 for more details.

https://gist.github.com/reddog335/f41e2c7eaff51f236041a2637a38cea2

kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE vcs-itg-azure-file Bound pvc-4c582f0e-fe50-11e9-9d9f-2ef301709b43 5Gi RWX azure-file-std-grs 81s

kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-4c582f0e-fe50-11e9-9d9f-2ef301709b43 5Gi RWX Retain Bound vcs-itg/vcs-itg-azure-file azure-file-std-grs 84s pvc-a092dd45-fe4b-11e9-9d9f-2ef301709b43 5Gi RWX Retain Released vcs-itg/vcs-itg-azure-file azure-file-std-grs 34m

kubectl get pod NAME READY STATUS RESTARTS AGE aks-helloworld-context-path-itg-59b59b89b4-l5rsx 1/1 Running 0 87s aks-helloworld-context-path-itg-59b59b89b4-s7k7m 1/1 Running 0 87s aks-helloworld-context-path-itg-59b59b89b4-sdl5m 1/1 Running 0 86s aks-helloworld-itg-79fc55b998-67tss 1/1 Running 0 86s aks-helloworld-itg-79fc55b998-gn2vn 1/1 Running 0 86s aks-helloworld-itg-79fc55b998-qvn8s 1/1 Running 0 86s azure-files-test-itg-65bbd9f965-55gt5 0/1 Init:0/1 0 85s azure-files-test-itg-65bbd9f965-cdjdk 0/1 Init:0/1 0 85s azure-files-test-itg-65bbd9f965-s8l48 0/1 Init:0/1 0 85s

kubectl describe pod azure-files-test-itg-65bbd9f965-55gt5 Name: azure-files-test-itg-65bbd9f965-55gt5 Namespace: vcs-itg Priority: 0 PriorityClassName: Node: aks-nodepool1-26580627-vmss000002/10.193.8.66 Start Time: Sun, 03 Nov 2019 09:40:57 -0600 Labels: app=azure-files-test-itg pod-template-hash=65bbd9f965 velero.io/backup-name=manual-backup velero.io/restore-name=manual-backup-20191103094041 Annotations: backup.velero.io/backup-volumes: vcs-itg-azure-file Status: Pending IP: 10.193.8.68 Controlled By: ReplicaSet/azure-files-test-itg-65bbd9f965 Init Containers: restic-wait: Container ID: docker://cd7ce5740aa2d1b2eb38784020452027a95b3912d69236efb6805d4d7429de74 Image: gcr.io/heptio-images/velero-restic-restore-helper:v1.1.0 Image ID: docker-pullable://gcr.io/heptio-images/velero-restic-restore-helper@sha256:e65015be7de40d47e8df7b3b923f7ce7bfa4a3243c00be50c08b3adc9e69e8af Port: Host Port: Args: 4a9a546f-fe50-11e9-9d9f-2ef301709b43 State: Running Started: Sun, 03 Nov 2019 09:40:59 -0600 Ready: False Restart Count: 0 Limits: cpu: 100m memory: 128Mi Requests: cpu: 100m memory: 128Mi Environment: POD_NAMESPACE: vcs-itg (v1:metadata.namespace) POD_NAME: azure-files-test-itg-65bbd9f965-55gt5 (v1:metadata.name) Mounts: /restores/vcs-itg-azure-file from vcs-itg-azure-file (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-9q7q2 (ro) Containers: azure-files-test-itg: Container ID: Image: repo.mutualofomaha.com:5003/com.mutualofomaha.img/alpine:3.10-latest Image ID: Port: Host Port: Command: tail -f /dev/null State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Limits: cpu: 200m memory: 256Mi Requests: cpu: 100m memory: 128Mi Environment: Mounts: /data from vcs-itg-azure-file (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-9q7q2 (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: vcs-itg-azure-file: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: vcs-itg-azure-file ReadOnly: false default-token-9q7q2: Type: Secret (a volume populated by a Secret) SecretName: default-token-9q7q2 Optional: false QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message


Normal Scheduled 2m20s default-scheduler Successfully assigned vcs-itg/azure-files-test-itg-65bbd9f965-55gt5 to aks-nodepool1-26580627-vmss000002 Normal Pulled 2m18s kubelet, aks-nodepool1-26580627-vmss000002 Container image "gcr.io/heptio-images/velero-restic-restore-helper:v1.1.0" already present on machine Normal Created 2m18s kubelet, aks-nodepool1-26580627-vmss000002 Created container Normal Started 2m18s kubelet, aks-nodepool1-26580627-vmss000002 Started container

kubectl delete pod azure-files-test-itg-65bbd9f965-55gt5 pod "azure-files-test-itg-65bbd9f965-55gt5" deleted

kubectl get pod NAME READY STATUS RESTARTS AGE aks-helloworld-context-path-itg-59b59b89b4-l5rsx 1/1 Running 0 4m26s aks-helloworld-context-path-itg-59b59b89b4-s7k7m 1/1 Running 0 4m26s aks-helloworld-context-path-itg-59b59b89b4-sdl5m 1/1 Running 0 4m25s aks-helloworld-itg-79fc55b998-67tss 1/1 Running 0 4m25s aks-helloworld-itg-79fc55b998-gn2vn 1/1 Running 0 4m25s aks-helloworld-itg-79fc55b998-qvn8s 1/1 Running 0 4m25s azure-files-test-itg-65bbd9f965-9cjnc 1/1 Running 0 21s azure-files-test-itg-65bbd9f965-cdjdk 0/1 Init:0/1 0 4m24s azure-files-test-itg-65bbd9f965-s8l48 0/1 Init:0/1 0 4m24s

kubectl exec -it azure-files-test-itg-65bbd9f965-9cjnc -- ls -lrt /data
total 0

kubectl exec -it restic-mj29r -n velero -- bash root@restic-mj29r:/# ls -rlt /host_pods total 28 drwxr-x--- 5 root root 4096 Oct 17 12:32 148bc1bf-f0da-11e9-ae16-260815deaa94 drwxr-x--- 5 root root 4096 Oct 25 11:25 156deeb7-f71a-11e9-ae16-260815deaa94 drwxr-x--- 5 root root 4096 Oct 28 15:14 9b39efad-f995-11e9-ae16-260815deaa94 drwxr-xr-x 5 root root 4096 Nov 3 15:04 4ae9d4bb-fe4b-11e9-9d9f-2ef301709b43 drwxr-x--- 5 root root 4096 Nov 3 15:40 532f5f7d-fe50-11e9-9d9f-2ef301709b43 drwxr-x--- 5 root root 4096 Nov 3 15:40 538acc95-fe50-11e9-9d9f-2ef301709b43 drwxr-x--- 5 root root 4096 Nov 3 15:45 e52d0ba8-fe50-11e9-9d9f-2ef301709b43

root@restic-mj29r:/# ls -rlt /host_pods/532f5f7d-fe50-11e9-9d9f-2ef301709b43/volumes total 4 drwxr-xr-x 3 root root 4096 Nov 3 15:40 kubernetes.io~secret

root@restic-mj29r:/# ls -rlt /host_pods/538acc95-fe50-11e9-9d9f-2ef301709b43/volumes total 4 drwxr-xr-x 3 root root 4096 Nov 3 15:40 kubernetes.io~secret

root@restic-mj29r:/# ls -rlt /host_pods/e52d0ba8-fe50-11e9-9d9f-2ef301709b43/volumes total 8 drwxr-xr-x 3 root root 4096 Nov 3 15:45 kubernetes.io~secret drwx------ 3 root root 4096 Nov 3 15:45 kubernetes.io~azure-file

root@restic-mj29r:/# ls -rlt /host_pods/e52d0ba8-fe50-11e9-9d9f-2ef301709b43/volumes/kubernetes.io~azure-file total 0 drwxr-xr-x 2 1000 1000 0 Nov 3 15:40 pvc-4c582f0e-fe50-11e9-9d9f-2ef301709b43

root@restic-mj29r:/# ls -rlt /host_pods/e52d0ba8-fe50-11e9-9d9f-2ef301709b43/volumes/kubernetes.io~azure-file/pvc-4c582f0e-fe50-11e9-9d9f-2ef301709b43 total 0

reddog335 commented 4 years ago

Empty-Azure-File-Share

skriss commented 4 years ago

Can you additionally provide kubectl -n velero get podvolumebackups -l velero.io/backup-name=manual-backup -o yaml?

It looks like the backup(s) are coming back empty again - really not sure why that's the case, given the files are clearly visible via the restic pod.

reddog335 commented 4 years ago

@skriss Below is the output of the command:

kubectl -n velero get podvolumebackups -l velero.io/backup-name=manual-backup -o yaml

apiVersion: v1 items:

skriss commented 4 years ago

@reddog335 I came across this old issue: https://github.com/vmware-tanzu/velero/issues/887, which I think may be coming into play here. Could you first show the YAML for your Azure file storage class, then make the change specified here, then try again?

You can take a look at the documentation I put up here: https://github.com/vmware-tanzu/velero/pull/2054/files for details.

reddog335 commented 4 years ago

BOOM!!! That worked like a champ. @skriss , you sir, are a steely-eyed missile man! Thank you so much for the help, I truly appreciate it!!!

nanmor commented 2 years ago

Hi @skriss, I meet the same error, but our storage is not AWS, it is a NFS server which based on a ppc64le. completionTimestamp: '2022-02-24T09:52:19Z' message: volume was empty so no snapshot was taken path: >- /host_pods/1950f55b-1e43-48c8-b049-f8292ed4cb2d/volumes/kubernetes.io~nfs/pvc-e0075e9e-22ed-43dd-bb48-11ad24252f8f phase: Completed progress: {} startTimestamp: '2022-02-24T09:52:18Z' I checked the restic daemonset pod , it have this volume and not empty sh-4.4# ls -lar total 4 drwxrwxrwx. 3 nobody nobody 4096 Feb 24 08:42 pvc-e0075e9e-22ed-43dd-bb48-11ad24252f8f drwxr-x---. 6 root root 121 Feb 24 09:26 .. drwxr-x---. 3 root root 54 Feb 24 09:26 .

what can i do next? thanks a lot.