Closed jawnsy closed 2 years ago
Hi @jawnsy , I was able to repro easily with:
$ docker run --runtime=sysbox-runc -it --rm nestybox/ubuntu-focal-systemd-docker
# Inside the container:
root@85d93eb89f98:~# docker pull gcr.io/cloud-foundation-cicd/cft/developer-tools:1
1: Pulling from cloud-foundation-cicd/cft/developer-tools
9d48c3bd43c5: Pull complete
9ce9598067e7: Pull complete
278f4c997324: Pull complete
...
failed to register layer: ApplyLayer exit status 1 stdout: stderr: lchown /build/terraform-validator: invalid argument
Will take a look to see what's going on ...
FYI: possible duplicate of issue #187, but will investigate further to confirm.
Update: I straced the docker pull
that fails, and it fails here:
5078 fchownat(AT_FDCWD, "/build/terraform-validator", 806984, 89939, AT_SYMLINK_NOFOLLOW <unfinished ...>
5078 <... fchownat resumed>) = -1 EINVAL (Invalid argument)
In contrast, when the docker pull
is done outside a Sysbox container, that same instruction works:
8621410:203602 fchownat(AT_FDCWD, "/build/terraform-validator", 806984, 89939, AT_SYMLINK_NOFOLLOW <unfinished ...>
8621421:203602 <... fchownat resumed>) = 0
I don't see mknod
(or the lack of it) as causing the problem, so the error looks different from issue #187.
I see the problem: in the fchown syscall:
fchownat(AT_FDCWD, "/build/terraform-validator", 806984, 89939, AT_SYMLINK_NOFOLLOW <unfinished ...>
the 3rd and 4th params are the uid:gid
. These look totally incorrect (they should have probably been set to 0:0 instead).
When running inside a Sysbox container, the user-IDs have a range of 65536, so I suspect the chown to a uid:gid outside this range is causing the kernel to return EINVAL.
For this same reason Podman + rootless also fails:
Error processing tar file(exit status 1): potentially insufficient UIDs or GIDs available in user namespace (requested 806984:89939 for /build/terraform-validator): Check /etc/subuid and /etc/subgid: lchown /build/terraform-validator: invalid argument
At host level, the user-IDs have a range of 2^32, so this is not a problem.
I think we can fix this is Sysbox, by ensuring that chowns that exceed 65535 are capped at the 65536 user ID (i.e., nobody
).
@ctalledo Do you have any idea why this fails even before the container is created, at pull time? Is this because the new image was built with a new version of docker?
The image in question doesn't seem to be doing anything particularly special, it's just extracting the binary. But it's possible that images built with an older docker version work fine, and images built with a new one result in this chown happening at pull time, and thus failing?
Is this a bug in docker or runc somewhere? I imagine the former, since it happens at pull time?
Hi @jawnsy,
Do you have any idea why this fails even before the container is created, at pull time?
During the pull, Docker extracts the layers that make up the image. It is during that extraction that we see the fchownat()
syscall with weird uid:gid (i.e., 806984:89939).
I don't know where these weird uid:gid come from; they certainly look incorrect. I don't know if the come from the image layers themselves or if it's a bug in Docker's image extraction code. I suspect it's the former.
Is this because the new image was built with a new version of docker?
Don't know.
Is this a bug in docker or runc somewhere? I imagine the former, since it happens at pull time?
runc is not involved at this stage, so likely it's a problem in the image itself or in the Docker extraction code as I mentioned above.
I think we can fix this is Sysbox, by ensuring that chowns that exceed 65535 are capped at the 65536 user ID (i.e., nobody).
It's possible to fix this in Sysbox, but it requires trapping the chown
syscall (which Sysbox currently does for a different reason), but we've learned that this can result in bad performance when programs inside the Sysbox container do lots of chown.
In addition, given that this problem appears to be specific to pulling the gcr.io/cloud-foundation-cicd/cft/developer-tools:1
image inside the sysbox container (we have no other reports of such an error), I am wondering if it's worth fixing ...
I don't know whether the problem is with the image itself or with the tools used to build it, and if the latter, then other images might be affected too. I agree that it may not make sense to fix this, unless it turns out to be a more pervasive problem than just this single image. I'm fine with closing this issue out.
Thanks for taking a look!
Thanks @jawnsy. Let's keep the issue open in case someone else hits it, and in case we find a way to fix it without impacting performance. Won't attempt to fix it now though, unless as you mentioned it turns out to be a more pervasive problem (there is no evidence of that currently).
Closing as there is no action item to fix this (and in fact the problem is specific to a particular image).
I'm seeing this error, unsure whether the problem is in sysbox or something else. Sharing details here to triage.
The Google Project Factory Terraform module has a lint step, which runs a public container image to generate docs and do code formatting. This errors out, and I'm not too sure why.
Here's my
docker system info
output, note that I'm running docker under sysbox in Coder:Available tags
Here are some of the visible tags:
Notes
docker run --rm -it gcr.io/cloud-foundation-cicd/cft/developer-tools:1.2
podman
in userspace mode on my laptop:Error: writing blob: adding layer with blob "sha256:cd161d4c1a089eaebfd0f869672c4d18d849997b8f1ce20887250ad61820844e": Error processing tar file(exit status 1): potentially insufficient UIDs or GIDs available in user namespace (requested 806984:89939 for /build/terraform-validator): Check /etc/subuid and /etc/subgid: lchown /build/terraform-validator: invalid argument
-- so this may just be a general problem with this specific image running in user namespace mode