Open hkjn opened 8 years ago
The exit status 6
is a pretty ugly hack I added that allows us to figure out where inside this file your code is failing. An error from process_linux.go
with "exit status 6" means that the 6th bail
in that file was executed (in the version of runC you're running).
To cut a long story short, this is the code that is failing:
/*
* We must fork to actually enter the PID namespace, and use
* CLONE_PARENT so that the child init can have the right parent
* (the bootstrap process). Also so we don't need to forward the
* child's exit code or resend its death signal.
*/
childpid = clone_parent(env, config->cloneflags);
if (childpid < 0)
bail("unable to fork"); /* this is where exit status 6 comes from */
So, the big question is -- does your system support all of the namespaces that you're trying to use? What is the output of ls -la /proc/self/ns
?
Ah, that helps explain the exit status, cheers.
What's odd here is that the failure was not consistent; sometimes the docker run
command would work fine if we ran it manually, even if it failed with systemd, later it seemed to be failing with this symptom consistently.
The node degraded further and won't even let me ssh
in now, so it's unfortunately hard to get more diagnostics from it.. another node which should be identically configured is giving the following output:
# ls -la /proc/self/ns
total 0
dr-x--x--x. 2 root root 0 Oct 20 09:13 .
dr-xr-xr-x. 9 root root 0 Oct 20 09:13 ..
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 net -> net:[4026532028]
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 pid -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 user -> user:[4026531837]
lrwxrwxrwx. 1 root root 0 Oct 20 09:13 uts -> uts:[4026531838]
But that node does not seem to hit the same issue as the first one; all services seem to have their containers start up fine.
I'll attach the info from /proc/self/ns
from a node with this issue if it pops up again, feel free to close this bug or leave it open for others to chip in if they also hit the same symptom (couldn't find anything on Google by searching for the symptoms myself), your call.
@hkjn Actually, the best thing would be for you to attach an strace -f
of runc
when the issue occurs. Though, since you're using Docker this might prove difficult (and it will have very large performance effects that aren't favourable). If you can reproduce having a node like that again, please try running any runC container set up (without Docker) on that machine with strace -f runc run ...
to see what breaks. Thanks.
@cyphar When I run nested runc ( runc inside runc), I'm getting the below error nsenter: unable to fork: Operation not permitted container_linux.go:247: starting container process caused "process_linux.go:245: running exec setns process for init caused \"exit status 6\"" May not be the right use case, thought of testing it out.
@rajasec That's because you're trying to unshare namespaces you don't have the right to unshare. You'll have to take a look at the kernel code to figure out precisely what's happening (if you're trying to run runc
from inside a chroot
it's not going to work, for example).
+1 have this error and don't use any runC for anything (though it might be used inside Mono). It also happens intermittently but mostly when the machine is tight on resources / overloaded.
Any other tips for debugging root cause if Im not using RunC?
I have this error with docker (I assume docker-runc?). Not sure how I would debug it. Give me something to type and I'll type it?
Some information that would be useful from anyone else who comments on this issue:
runc
by itself -- outside of Docker? Read the README for information on how to start up a simple container.No user namespaces. SELinux is enabled & permissive Don't have "runc". I have "docker-runc" which says its 1.0.0-rc2. Is that runc? Centos 7.2: 3.10.0-327.36.2.el7.x86_64 #1 SMP Mon Oct 10 23:08:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
I'll have to tool around with it. I don't get a container following the runc readme. Doing something daft I expect.
@jamiethermo docker-runc
is just what Docker calls it's packaged version of runc
.
You can create a container like this:
% mkdir -p bundle/rootfs
% docker create --name a_new_rootfs alpine:latest true
% docker export a_new_rootfs | tar xvfC - bundle/rootfs
% runc spec -b bundle
% runc run -b bundle container
/ # # This is inside the container now.
Does that help?
Ok. That works.
Alright, it would help to know what config.json
the container is being started with (under Docker). Unfortunately Docker won't save the config.json
if the container creation fails. You could try doing something like this:
% cat >/tmp/dodgy-runtime.sh <<EOF
#!/bin/sh
cat config.json >>/tmp/dodgy-runtime.log
exit 1
EOF
% chmod +x /tmp/dodgy-runtime.sh
% docker daemon --add-runtime="dodgy=/tmp/dodgy-runtime.sh" --default-runtime=dodgy
Then try to start a container. It will fail, but you should be able to get the config.json
from /tmp/dodgy-runtime.log
. You can then modify it so that the rootfs
entry is equal to the string "rootfs
" and then replace bundle/config.json
in my previous comment with the old file.
Then runC should fail to start. Paste the config you got here.
Ok. Can't do that right now. But since it seems arbitrary what is running and what is failing (the same docker image will run one minute and not the next), here's a config file that did get created. Don't know if that'll help. Will try the hack, above, tomorrow. Thanks! config.json.zip
For people who get "exit status x",you can get the runc code you are using, then:
# cd libcontainer/nsenter
# gcc -E nsexec.c -o nsexec.i
Then you can find out which bail
you hit from nsexec.i
.
It's ugly though, we should improve it someday.
@hqhq Or you can count from the start of the file (which is what I do). Vim even has a shortcut for it. But yes, the bail(...)
code was a hack to get around the fact that we aren't writing our errors to the error pipe in nsexec
-- the only information we get is the return code. :P
@cyphar Could I replace docker-runc with a bash script that saves off the config.json somewhere if it crashes? Could we make runc do that by default?
Could I replace docker-runc with a bash script that saves off the config.json somewhere if it crashes?
You could try that. By the way, if you haven't created an upstream bug report (in Docker) please do so.
Could we make runc do that by default?
I don't want to, mainly because it'd only be helpful for debugging things in certain cases under Docker. And runC is not just used inside Docker.
ECS team thinks this issue is causing their agent to disconnect at times. Referenced https://github.com/aws/amazon-ecs-agent/issues/658#issuecomment-271752302
I "fixed" by upgrading from Ubuntu 15.04 -> 16.04. It might be a bug in an old version that is no longer maintained.
On Wed, Feb 1, 2017 at 6:24 PM, James Yang notifications@github.com wrote:
ECS team thinks this issue is causing their agent to disconnect at times. Referenced aws/amazon-ecs-agent#658 (comment) https://github.com/aws/amazon-ecs-agent/issues/658#issuecomment-271752302
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/opencontainers/runc/issues/1130#issuecomment-276816486, or mute the thread https://github.com/notifications/unsubscribe-auth/ACI6mXS1bSYH35c_Dv020e6jfsrqfnrEks5rYRRLgaJpZM4Kbwol .
-- Jared Broad
hm might have to try that
@cyphar is there a workaround for this? besides upgrading to ubuntu 16?
@jamesongithub It's likely that issues of this form are kernel issues (and since Ubuntu has interesting kernel policies, upgrading might be your only option), unless you have some very odd configurations. As I mentioned above, the error only tells us what line inside libcontainer/nsenter/nsexec.c
failed (and unshare
can fail for a wide variety of reasons).
I've been having this issue with RHEL 7.3 too
SELINUX=enforcing
SELINUXTYPE=targeted
Besides being inexperienced with stuff like ns and runc, I'm struggling to figure out what's going on because it's intermittent as mentioned by @jamesongithub
ls -la /proc/self/ns
shows the same results as @hkjn
@cyphar @rhatdan Same issue on RHEL 7.4, but exit status is 40, user namespace is enabled as per this doc: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html/getting_started_with_containers/get_started_with_docker_formatted_container_images#user_namespaces_options.
On latest available kernel.
For anyone having issues with RHEL only enable this option: namespace.unpriv_enable=1 and not this user_namespace.enable=1 having both in cmdline causes issues:
[ec2-user@ip-10-16-1-55 mycontainer]$ cat /proc/cmdline | grep "namespace.unpriv_enable=1"
BOOT_IMAGE=/boot/vmlinuz-3.10.0-693.11.6.el7.x86_64 root=UUID=de4def96-ff72-4eb9-ad5e-0847257d1866 ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 crashkernel=auto LANG=en_US.UTF-8 namespace.unpriv_enable=1
[ec2-user@ip-10-16-1-55 mycontainer]$ runc --root /tmp/runc run --no-pivot --no-new-keyring mycontainerid
/ #
I came here from google for a similar error. Turns out, I was trying to use the VOLUME
directive in my dockerfile like this:
VOLUME . /src
thinking I could mount the current directory from the host as a volume like that, but that's not how it works.
You have to, instead, do this:
VOLUME /src
followed by
docker run -v /absolute/path/to/directory/on/host:/src <rest of your docker run command>
Note also (and somewhat unrelated) that I was getting similar errors on Fedora simply related to SELinux. And while I don't recommend doing the following for security reasons (see: http://stopdisablingselinux.com/), it did work for me:
sudo setenforce 0
sudo systemctl restart docker
docker build -t image .
docker run image
I meet the same problem, when I build and start a image.
Sending build context to Docker daemon 220 MB
Step 1 : FROM warpdrive:tos-release-1-5
---> 769306738d96
Step 2 : COPY . /go/src/github.com/transwarp/warpdrive/
---> 07c99697b16e
Removing intermediate container 127c0e71a84b
Successfully built 07c99697b16e
/usr/bin/docker: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:245: running exec setns process for init caused \\\"exit status 6\\\"\"\n".
FATA[0301] exit status 125
make: *** [build] Error 1
Then I clean the a lot of images and containers and free the caches, the problem is disappear. But I think is not a cache problem because of the change of cache is tiny.
seems related to: https://forums.docker.com/t/centos7-docker-hello-world-fails/68941/3
It is bug of kernel(3.10.0-327),try to update your kernel version.
Hi OCI folks,
We are seeing a failure to start Docker containers through
runc
, seemingly from this line:This might well be a config or system issue (we're on somewhat old Kernel versions because CentOS..), but the logs don't give so much to go on here..
The
man
pages forsetns
is defining the error codes it should return:But if the following page can be trusted,
exit status 6
should beENXIO
, which is not mentioned in theman
pages:Any suggestions for how to debug further or what to check would be appreciated, thanks in advance!
Logs
System info