Open cyphar opened 8 years ago
Alright, after staring at strace logs and /proc/<pid>/task/*/status
for a couple of hours I think I know what's going on. In the strace logs, the following call is what we stall on (in the runc init
process):
[pid 3753] execve("/usr/bin/ls", ["ls", "-la", "/proc/self/fd/"], [/* 3 vars */] <unfinished ...>
Now, the semantics of execve
are "fun" for multithreaded processes. Basically, only single-threaded processes are capable of undergoing an execve
. So when you do an execve
, the kernel is meant to kill every other task in the threadgroup. But that's not happening. Why? Because for some reason all of the other tasks in this thread group appear to be zombies. And if zombie movies have told us anything, you can't kill a zombie. This only happens sometimes and I'm not really sure why:
% grep -i state /proc/<pid>/task/*/status
State: D (disk sleep)
State: Z (zombie)
State: Z (zombie)
State: Z (zombie)
State: Z (zombie)
State: Z (zombie)
But this only happens when you hit the console changes, so clearly something changed in the process management that caused this. I'll have to look into it some more.
Okay, this is a bit worrying. It also happens with 0.1.1
. Here's my config.json
:
{
"ociVersion": "1.0.0-rc1",
"platform": {
"os": "linux",
"arch": "amd64"
},
"process": {
"terminal": true,
"user": {
"uid": 0,
"gid": 0
},
"args": [
"ls",
"-la",
"/proc/self/fd/"
],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"TERM=xterm"
],
"cwd": "/",
"capabilities": [
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE"
],
"rlimits": [
{
"type": "RLIMIT_NOFILE",
"hard": 1024,
"soft": 1024
}
],
"noNewPrivileges": true
},
"root": {
"path": "rootfs",
"readonly": true
},
"hostname": "runc",
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "proc"
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
{
"destination": "/dev/pts",
"type": "devpts",
"source": "devpts",
"options": [
"nosuid",
"noexec",
"newinstance",
"ptmxmode=0666",
"mode=0620"
]
},
{
"destination": "/dev/shm",
"type": "tmpfs",
"source": "shm",
"options": [
"nosuid",
"noexec",
"nodev",
"mode=1777",
"size=65536k"
]
},
{
"destination": "/dev/mqueue",
"type": "mqueue",
"source": "mqueue",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/sys",
"type": "none",
"source": "/sys",
"options": [
"rbind",
"nosuid",
"noexec",
"nodev",
"ro"
]
}
],
"hooks": {},
"linux": {
"uidMappings": [
{
"hostID": 1000,
"containerID": 0,
"size": 1
}
],
"gidMappings": [
{
"hostID": 100,
"containerID": 0,
"size": 1
}
],
"namespaces": [
{
"type": "pid"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
},
{
"type": "user"
}
],
"maskedPaths": [
"/proc/kcore",
"/proc/latency_stats",
"/proc/timer_stats",
"/proc/sched_debug"
],
"readonlyPaths": [
"/proc/asound",
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
]
},
"solaris": {
"cappedCPU": {},
"cappedMemory": {}
}
}
This is actually a kernel issue. A kernel hacker at SUSE has come up with the following patch. Assuming upstream says okay, this will be backported to stable kernels. I'm testing this right now to see if it fixes the runC issue (the test case I gave him was a regular Go program):
From fe82d463fd2ef1585d2c37bf9fa6a1761e6ee0e5 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 22 Sep 2016 10:09:34 +0200
Subject: [PATCH] signal: always signal tracer from the ptraced task
Aleksa has reported the following lockup when stracing the following go
program
% cat exec.go
package main
import (
"os"
"syscall"
)
func main() {
syscall.Exec("/bin/echo", []string{"/bin/echo", "Hello"}, os.Environ())
}
$ go version
go version go1.6.3 linux/amd64
$ go build -o exec exec.go
$ strace -f ./exec
[snip]
[pid 10349] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 10346] <... select resumed> ) = 0 (Timeout)
[pid 10346] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 10345] execve("/bin/echo", ["/bin/echo", "Hello"], [/* 95 vars */] <unfinished ...>
[pid 10346] <... select resumed> ) = 0 (Timeout)
[pid 10349] <... select resumed> ) = 0 (Timeout)
execve will never finish unless the strace process get killed with
SIGKILL. The reason is the following deadlock
tracer thread_A thread_$N
SyS_process_vm_readv SyS_execve do_exit
do_execveat_common exit_notify
prepare_bprm_creds do_notify_parent
mutex_lock(cred_guard_mutex) __group_send_sig_info
search_binary_handler send_signal
load_elf_binary prepare_signal -> fail SIGHCHLD is SGL_DFL
flush_old_exec
# wait for sig->notify_count
process_vm_rw
process_vm_rw_core
mm_access
mutex_lock_killable(cred_guard_mutex)
So there seems to be 2 issues here. The first one is that exiting
threads (because of the ongoing exec) are not sending SIGCHLD signal
to the tracer but they rely on the tracer to reap them and call
release_task->__exit_signal which in turn would wake up the thread_A and
move on with the exec. The other part of the story is that the tracer
is not in do_wait but rather calls process_vm_readv (presumably to get
arguments of the syscall) and it waits for a lock in killable rather
than interruptible sleep.
The fix is therefore twofold. We want to teach mm_access to sleep in
interruptible sleep and we want to make sure that the traced child
will send the signal to the parent even when it is ignored or SIG_DFL.
sig_ignored already seems to be doing something along that line except
it doesn't handle when a traced child sends a signal to the tracer.
Fix this by checking the current ptrace status and whether the target
task is the tracer.
Reported-by: Aleksa Sarai <asarai@suse.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
kernel/fork.c | 2 +-
kernel/signal.c | 4 ++++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 5a57b9bab85c..d5b7c3aea187 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -837,7 +837,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
struct mm_struct *mm;
int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex);
+ err = mutex_lock_interruptible(&task->signal->cred_guard_mutex);
if (err)
return ERR_PTR(err);
diff --git a/kernel/signal.c b/kernel/signal.c
index 96e9bc40667f..5c8b84b76f0b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -91,6 +91,10 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
if (!sig_task_ignored(t, sig, force))
return 0;
+ /* Do not ignore signals sent from child to the parent */
+ if (current->ptrace && current->parent == t)
+ return 0;
+
/*
* Tracers may want to know about even ignored signals.
*/
--
2.9.3
EDIT: I've tested the above kernel patch and it fixes it for me. Does this also work for you @mikebrow?
It turns out the above patch actually "breaks" strace, and just happens to work. And the maintainer appears to believe this is a non-trivial thing to fix. Personally I'm going to keep applying the patch locally, since it does fix the problem for me.
@cyphar have you though about opening an issue in the Golang tracker to see if they can do something on their end to fix this?
@marcosnils It's a kernel issue, not a Golang issue. You can reproduce it with pthreads
-- it happens with any multithreaded program that calls execve
under strace -f
. Apparently this is a known issue upstream, but the solution is currently non-trivial (it requires a bunch of rewrites of some security hooks). I was looking into it, but to be honest it's all a bit over my head (the signalling semantics are super weird).
Believe it or not, this was apparently fixed on the kernel side. Using
sudo strace -f ./runc run 123
no longer stalls (using a simple config.json which calls echo 123
).
I haven't checked if strace log is complete or there are some omissions but it no longer stalls.
If you run
runc
understrace
, it will stall randomly. The stall always happens at theexecve
of therunc init
process. This was discovered while doing #1018. I will copy-paste my discussion from there here.As far as I can tell, this happens on both
master
and cc29e3dded8e27ba8f65738f40d251c885030a28. It appears to be related to the exec fifo, but I'll test it out further now.