Open benjaminxie opened 2 years ago
@benjaminxie Thanks for digging into this issue. Velero v.1.7.0 integrated with Restic v0.12, so it should also included the fix you mentioned. That issue seems related to restore with selector. Velero doesn't support that in pod volume restore. Maybe not a same issue. Could you help to collect the backup used by the restore's debug bundle? And is there any noteworthy difference of your use case? Didn't hear about this error reported before.
@blackpiglet Thanks for your response!
Could you help to collect the backup used by the restore's debug bundle?
I'm not sure what you mean by this. Are you asking me to upload the corresponding backup that's in my Azure storage account? If so, I think they may have been deleted.
So here's a different debug bundle bundle-2022-06-14-16-53-51.tar.gz
and here is the corresponding backup backup.zip.
And is there any noteworthy difference of your use case?
I don't think I'm doing anything out of the ordinary but I think I can provide access to the environment where this bug is happening if that helps.
I also found some more things today with this bug. I can avoid the bug if I cut out enough resources from my helm chart. However, if I so much as just add back the load balancer resource, the bug reappears.
The helm chart with working restore had a deployment, a persistent volume claim, a job that downloads a jdbc jar (this is the file that's missing after restore) and puts it into the volume, a storage class (file.csi.azure.com), and a secret (which I think can be taken out but I didn't check).
@benjaminxie Oh, I found the backups related information and logs are also included in previous uploaded bundle file. I thought that bundle only including restore related information.
I found something noteworthy. There are two PodVolumeRestores point to the same volume.
PodVolumeRestore: jpcluster-hisfiles-20220606-1-20220613161022-28md4
"spec": {
"backupStorageLocation": "default",
"pod": {
"kind": "Pod",
"name": "ecosys-6fb88b5666-cz6lk",
"namespace": "ecosys",
"uid": "3975513b-8c25-4f4f-8c56-8ef20067f16d"
},
"repoIdentifier": "azure:bxie-backup:/restic/ecosys",
"snapshotID": "824215f9",
"volume": "ecosys-pvc"
}
PodVolumeRestore: jpcluster-hisfiles-20220606-1-20220614115206-7dzh7
"spec": {
"backupStorageLocation": "default",
"pod": {
"kind": "Pod",
"name": "ecosys-6fb88b5666-cz6lk",
"namespace": "ecosys",
"uid": "d312b804-2e67-4e69-a963-91b4a739ce9e"
},
"repoIdentifier": "azure:bxie-backup:/restic/ecosys",
"snapshotID": "824215f9",
"volume": "ecosys-pvc"
}
Looks like the two pods are managed by the same deployment, and they both mount the same PVC. That will cause two PodVolumeRestores both run Restic command to restore data into the same directory. Maybe that is the cause of the problem. I think you can try to skip one of the pods' volume backup and restore to see whether that will fix your error. Please take this link for a reference to how to apply opt-out approach with Restic: https://velero.io/docs/v1.9/restic/#using-the-opt-out-approach
@blackpiglet I tried your suggestion but the problem still comes up. Here's the debug bundle if you want it bundle-2022-06-15-19-13-55.tar.gz
@benjaminxie Checked with the log. Looks like need to read some Restic and MinIO code to dig deeper. Tried to trace down, but it seems the S3 backup just complaining the object is supposed to be file, but it finds out the object is directory instead.
Post the error log here for other colleagues who are more familiar with Restic to begin from.
level=error msg="unable to successfully complete restic restores of pod's volumes" error="pod volume restore failed: error restoring volume: error running restic restore, cmd=restic restore --repo=azure:ecosys-dr-backup:/restic/ecosys --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic a40a5e5a --target=., stdout=restoring <Snapshot a40a5e5a of [/host_pods/c92930f8-4172-4e34-9c59-cd8644e0a711/volumes/kubernetes.io~csi/pvc-1dfe3635-856e-4ce3-8f08-de3ddda993d1/mount] at 2022-06-15 18:08:15.270383686 +0000 UTC by root@velero> to .\n, stderr=ignoring error for /lib/jdbc.jar: open /host_pods/6be8bd89-5ad8-4f5b-9e93-6136be41336f/volumes/kubernetes.io~csi/pvc-737dc392-4206-4cb6-892d-fc8d2e8af129/mount/lib/jdbc.jar: is a directory\nignoring error for /lib/jdbc.jar: open /host_pods/6be8bd89-5ad8-4f5b-9e93-6136be41336f/volumes/kubernetes.io~csi/pvc-737dc392-4206-4cb6-892d-fc8d2e8af129/mount/lib/jdbc.jar: is a directory\nFatal: There were 2 errors\n\n: exit status 1" logSource="pkg/restore/restore.go:1433" restore=velero/exclude-volume-backup-on-one-pod-2-20220615181112
@benjaminxie Could you try to use Restic to restore the problematic backup manually and see what happens? If you see the same problem, please share us the restic restore log.
You can refer to https://restic.readthedocs.io/en/latest/050_restore.html for how to restore a backup through Restic.
@Lyndon-Li Thanks so much for taking on this issue!
I manually inputted the restic command that was in the output of the velero restore describe
output, so something like restic -r <repo string> restore <snapshotID> --target .
and got the same error. I'm not sure what you meant by the restic restore logs but I think I found the root cause.
After running the restic restore command, I expect the working directory to have a file ~/lib/jdbc.jar
. However, before running the command, there already exists the directory ~/lib/jdbc.jar
. In other words, I see that the lib
folder contains another folder called jdbc.jar
. So as we've already noted, restic will see this directory in the same place where it wants to put the jdbc.jar file and thus the error happens.
But where does this directory come from? Originally the jdbc.jar file got there by being mounted from a persistent volume. And it got into the pv via a job that mounts and downloads it into the pv. Finally, the job is deleted after it completes via helm hooks so it will not be captured by the velero backup. Note that because the job does not exist after the velero restore, there will be no jdbc.jar file in the volume to mount into my deployment's pod. However, since the yaml still dictates to mount this non-existent file, I instead have the directory ~/lib/jdbc.jar
.
Perhaps Velero needs a different way of using restic to restore files to the pv in order to fix this issue?
@benjaminxie Could you share the yaml snippet that dictates to mount the jar file?
@Lyndon-Li Here you go
volumeMounts:
- mountPath: /opt/ecosys/ESFM_HOME/lib/jdbc.jar
subPath: ./lib/jdbc.jar
name: ecosys-pvc
@benjaminxie Kubernetes subpath is implemented by bind mount. As the current Kubernetes behavior, if the mount source doesn't exist, Kubernetes will create a dir for it. Actually, Kubernetes does two-phase mount, first from PV's host path to pod container's host path, then from pod container's host path to container mount path by docker. This is why you see a jdbc.jar dir on the mount source.
Conclusively, the problem comes down to: the creation of pod not only relies on the existence of the PV, but also on the existence of PV's data. However, as the current Velero's Restic restore workflow, pod is created first, then PVs are provisioned by Kubernetes and finally Restic runs inside the pod to restore the PV's data.
Therefore, I am afraid this is a generic problem for pods with subPath.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Closing the stale issue.
What steps did you take and what happened: I am restoring my cluster and, after exec-ing into a pod, finding that a mounted file is not being properly restored. Instead of finding my file at the expected path, I find a directory with the same name. Moreover, in the output of velero restore describe, the error message includes
open <file path>: is a directory
.What did you expect to happen: After restoring the cluster, my pod's container should still have the file mounted at the expected path.
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
bundle-2022-06-13-18-54-57.tar.gz
Anything else you would like to add: This behavior only happens when I restore using Velero v1.7.0 and v1.8.1 and not Velero v1.6.3. Thus, I thought this was related to this restic issue, however Velero v1.6.3 is already past the problematic version of restic specified in the linked issue.
Environment:
Using AKS.
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.