Restic restores fails with error lchown: operation not permitted

amareshgreat commented 2 years ago

What steps did you take and what happened: [A clear and concise description of what the bug is, and what commands you ran.)

I have installed velero 1.9.1 with restic integrated to it. I have efs backed volumes on aws cluster. Backup of volumes and manifests are working fine. But when I restore it from backup, restore fails with below error message

time="2022-09-27T09:28:21Z" level=info msg="Waiting for all restic restores to complete" logSource="pkg/restore/restore.go:551" restore=velero/sept12-20220927092909
time="2022-09-27T09:28:31Z" level=error msg="unable to successfully complete restic restores of pod's volumes" error="pod volume restore failed: error running restic restore, cmd=restic restore --repo=s3:s3-us-east-1.amazonaws.com/otfi-main-use1-dryrun-838605173453-velero-backups/restic/test-nginx --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic 5398f2d3 --target=., stdout=restoring <Snapshot 5398f2d3 of [/host_pods/5f8c7713-0037-41b9-af23-e68f8c9b1fb8/volumes/kubernetes.io~csi/pvc-fb43dca1-92c0-484b-89f2-3304758b11fa/mount] at 2022-09-27 09:26:45.099291784 +0000 UTC by root@velero> to .\n, stderr=ignoring error for /testefs.txt: Lchown: lchown /host_pods/2b5078f0-3669-45b4-bba1-d06282782887/volumes/kubernetes.io~csi/pvc-92aea1b9-3fb9-4cf2-adda-240191b38cd4/mount/testefs.txt: operation not permitted\nFatal: There were 1 errors\n\n: exit status 1" logSource="pkg/restore/restore.go:1560" restore=velero/sept12-20220927092909

What did you expect to happen:

The following information will help us better understand what's going on:

Restore should have been completed successfully.

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

bundle-2022-09-27-10-01-29.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Velero version (use velero version): 1.9.1
Velero features (use velero client config get features): NOT SET
Kubernetes version (use kubectl version): 1.21
Kubernetes installer & version: AWS EKS
Cloud provider or hardware configuration: AWS and EFS as persistent volume
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

sseago commented 2 years ago

I believe that this is a known issue with EFS volumes and restic. The problem is that the uid/gid are different in the restored volume. See the below comment on a similar issue: https://github.com/vmware-tanzu/velero/issues/2958

" This is bugging me as well. I may have figured out, what the issue here is.

If you use efs-csi dynamic provisioning of volumes, every pvc gets a unique access-point (with a unique uid/gid) for the efs volume. If you then restore a volume, restic tries to change the ownership back to the old uid/gid, which is not possible.

To solve this, velero-restic-helper would need an option to ignore the old uid/gid while restoring.

If you use a "static" efs pv/pvc the uid/gid won't change. Therefore the Restore works as expected. One big Issue for this Solution is, that you have to manually delete/recreate the PV before starting the restore."

sseago commented 2 years ago

It looks like you may be able to work around this with static provisioning. We have not yet tried to find a long-term fix for the problem in velero itself, but it may be that the suggestion for velero-restic-restore-helper would work.

sseago commented 2 years ago

@Lyndon-Li I'm guessing that kopia would fail in a similar way here, but it's worth investigating whether there's an easier fix for the problem with kopia than with restic.

amareshgreat commented 2 years ago

@sseago - we have a hard requirement for dynamic provisioning of PVC. do we have any workaround to restore the dynamic provisioned EFS based PVC ? do we have an option to ignore the old uid/gid using velero-restic-helper/velero-restic-restore-helper ?

sseago commented 2 years ago

@amareshgreat At this point we don't have a fix, so if the workaround isn't possible in your environment, the only other option would be to wait until a fix can be developed and put into a release. I've seen at least one other person hit this bug recently, so it may be time to prioritize getting a fix in place here.

Lyndon-Li commented 2 years ago

@sseago Kopia could solve this problem by simply specifying either of below two options: --skip-owners: if specified, Kopia restore will skip restoring the uid/gid --ignore-permission-errors: if specified, Kopia restore will ignore the error if it is a permission error

Therefore, what Velero needs to do is expose the similar flags for PVR, and then pass the same options to Kopia.

However, in Velero v1.0, we don't plan to add any user experience changes for PVB/PVR, therefore, we will add this to the next release together with some other new flags.

reasonerjt commented 2 years ago

Thanks @Lyndon-Li so this issue will remain as a problem in velero v1.10 but we may fix it in the kopia path in future release.

navilg commented 2 years ago

@Lyndon-Li Are we going away from restic in future velero releases or will restic and kopia exist parallelly and will be as option to users to use?

sseago commented 2 years ago

@navilg I believe the plan is that we will eventually drop restic support, but they will exist in parallel for some time before then. I don't know that we've made a firm decision as to which release will drop restic.

Lyndon-Li commented 1 year ago

In v1.10, Kopia's IgnorePermissionErrors flag has been set to true, this means, when Kopia uploader encounters the same problem, it will ignore it.

It means this problem has been fixed under Kopia path in v1.10.

And it seems that it is not a prioritized task to expose IgnorePermissionErrors to Velero's CLI, since by default ignoring the permission errors is not a bad thing, we don't see a situation that permission errors must not be ignored.

Lyndon-Li commented 1 year ago

Let's verify this in v1.11 for Kopia path. For Restic path, since there is no way to fix, we will leave it as is.

Lyndon-Li commented 1 year ago

@navilg v1.10 kopia path is confirmed to support IgnorePermissionErrors. Therefore, 1.10 kopia path should work with the current scenario. Please try it.

Lyndon-Li commented 1 year ago

Closing this issue as it has been fixed in Kopia path and we have no plan or solution to fix it in Restic path.

vmware-tanzu / velero

Restic restores fails with error lchown: operation not permitted #5403