Closed Sid-Sun closed 2 years ago
Thanks for filing the issue @Sid-Sun.
I haven't been able to reproduce it so far, but will try again in a different node with a similar kernel version -- I'm running 5.11 in my current setup:
rmolina@dev-vm1:$ docker run --runtime=sysbox-runc -it --rm --name test-issue-451 --hostname test-issue-451 issue-451
admin@test-issue-451:~$ docker version
Client:
Version: 18.09.7
API version: 1.39
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:56:23 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.7
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:23:02 2019
OS/Arch: linux/amd64
Experimental: false
admin@test-issue-451:~$
admin@test-issue-451:~$ docker pull rrdockerhub/ros-base-melodic-amd64
Using default tag: latest
latest: Pulling from rrdockerhub/ros-base-melodic-amd64
5b7339215d1d: Pull complete
14ca88e9f672: Pull complete
a31c3b1caad4: Pull complete
b054a26005b7: Pull complete
2d951c0d117b: Pull complete
7a080df8aa29: Pull complete
af7dec1d4371: Pull complete
7ae51eab5aaf: Pull complete
0c1c60218d5d: Pull complete
c81272701b33: Pull complete
04c6d168e31d: Pull complete
51f578a54dfd: Pull complete
6323fa587f09: Pull complete
b4a3394e02f5: Pull complete
Digest: sha256:418c1c5d28c4829a22ace56f3e3b120fa57088bda99e678cb46afb7d598853ad
Status: Downloaded newer image for rrdockerhub/ros-base-melodic-amd64:latest
admin@test-issue-451:~$
Hi @Sid-Sun,
root@bionic-test:~# strace -o strace.txt docker pull rrdockerhub/ros-base-melodic-amd64
Please repeat this with the -f
flag (i.e., "follow forks"):
strace -f -o strace.txt docker pull rrdockerhub/ros-base-melodic-amd64
then attach the resulting strace output.
Thanks!
@ctalledo
Hey! Thanks for the quick response, here is the strace with forks follow
@rodnymolina That's interesting... I will try to update the kernel on one of the nodes and try to reproduce it thanks :D
Thanks @Sid-Sun, that was helpful though I still don't see the root cause.
Here is what I see:
epoll_ctl
syscall on file-descriptor 5:1424 openat(AT_FDCWD, "/lib/terminfo/x/xterm-256color", O_RDONLY|O_CLOEXEC) = 5
1424 epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=154258992, u64=140295260982832}} <unfinished ...>
1424 <... epoll_ctl resumed> ) = -1 EPERM (Operation not permitted)
1424 epoll_ctl(4, EPOLL_CTL_DEL, 5, 0xc42019331c <unfinished ...>
1424 <... epoll_ctl resumed> ) = -1 EPERM (Operation not permitted)
There are no other EPERM errors, so this must be the cause.
However, it's weird: the EPERM occurs very early in the strace (e.g., at lines 1058 and 1063), and the strace continues for much longer before the docker pull error shows up in line 25261: "failed to register layer: Error processing tar file...".
Why that epoll_ctl() returns EPERM is not clear to me. Per the man page for epoll_ctl(2), EPERM is only returned when the file on which the epoll occurs does not support polling (e.g., when it's a regular file or directory, rather than a socket or pipe). That would be file /lib/terminfo/x/xterm-256color
inside the image, which is a regular file (as it is in my Ubuntu host).
With all of this, I am suspecting this is a Docker error (i.e., it's epolling a regular file), and the fact that you don't see it with later versions of Docker would seem to add weight to this. I certainly can't see how it could be related to Sysbox.
Do you know if this occurs outside of a Sysbox container (e.g., when using the same version of Docker on a regular host)?
Also, it may be worthwhile getting the same strace with Docker 19.03 pulling the same image, so we can compare.
Finally, on a quick web search, it seems it may be related to this Stack Overflow post, tracked by this Moby issue. That would explain it since Sysbox containers are always rootless.
Hi @ctalledo,
I've run a few more tests to get some more data points; on both native (VMs) and sysbox environments.
What I have observed is:
Docker 18 (and likely below) on native environments:
xterm
is present but image pull succeedsIn sysbox environment with kernel 5.4:
In sysbox environment with kernel 5.11:
Docker 19 and above:
It appears there is some sort of incompatibility between docker 18, sysbox and kernel 5.4 - swapping any one of these seems to resolve the incompatibility
Note: Docker 18 is not available for focal out of the box, i had to use docker's bionic archive to install it
You can find the strace of the tests I ran here
Hi @Sid-Sun, thanks for the excellent data points.
Based on the fact that with Docker 18 the "polling xterm" error is always present, yet sometimes the pull fails and sometimes it doesn't, it seems to me that error is a red-herring. This also gels with my observation that this error occurs much earlier in the strace than the failure point.
However the weird thing is that there is no other EPERM error in the strace that would give us a better hint on why the pull fails.
I think I'll need to repro on my side to dig down further.
Let me ask: how important is that you use Docker 18? (wondering how high I need to prioritize this).
Thanks again.
Hi, we've moved to a newer docker version, I guess it would be best to note some potential incompatibility in the documentation until this is nailed down
Thanks @Sid-Sun ; I think this issue is probably the best way to document the problem (until we are able to better understand the underlying root cause).
Closing for now, please re-open if this occurs again.
When running docker 18.09.7 on Ubuntu Bionic, certain image pulls fail with:
failed to register layer: Error processing tar file(exit status 1): operation not permitted
the issue does not occur with docker versions 19.03 or higher on bionic the issue also cannot be reproduced on all docker images, only certain layers in some images seem to fail, we have been able to consistently reproduce with
rrdockerhub/ros-base-melodic-amd64:latest
might be similar to #254Environment Details:
5.4.0-1053-gke
Complete output:
strace: Posted here
When we initially identified the issue, strace had:
However it might have been due to something else, we were using a different image at the time. Using a cleaner image directly from nestybox and only setting specific docker version gets rid of this
Dockerfile used to create the image (same as nestybox modified to specify docker version):
Pod spec:
Thanks :D