runc run stalls under strace

cyphar commented 8 years ago

If you run runc under strace, it will stall randomly. The stall always happens at the execve of the runc init process. This was discovered while doing #1018. I will copy-paste my discussion from there here.

As far as I can tell, this happens on both master and cc29e3dded8e27ba8f65738f40d251c885030a28. It appears to be related to the exec fifo, but I'll test it out further now.

cyphar commented 8 years ago

Alright, after staring at strace logs and /proc/<pid>/task/*/status for a couple of hours I think I know what's going on. In the strace logs, the following call is what we stall on (in the runc init process):

[pid  3753] execve("/usr/bin/ls", ["ls", "-la", "/proc/self/fd/"], [/* 3 vars */] <unfinished ...>

Now, the semantics of execve are "fun" for multithreaded processes. Basically, only single-threaded processes are capable of undergoing an execve. So when you do an execve, the kernel is meant to kill every other task in the threadgroup. But that's not happening. Why? Because for some reason all of the other tasks in this thread group appear to be zombies. And if zombie movies have told us anything, you can't kill a zombie. This only happens sometimes and I'm not really sure why:

% grep -i state /proc/<pid>/task/*/status
State:  D (disk sleep)
State:  Z (zombie)
State:  Z (zombie)
State:  Z (zombie)
State:  Z (zombie)
State:  Z (zombie)

But this only happens when you hit the console changes, so clearly something changed in the process management that caused this. I'll have to look into it some more.

cyphar commented 8 years ago

Okay, this is a bit worrying. It also happens with 0.1.1. Here's my config.json:

{
    "ociVersion": "1.0.0-rc1",
    "platform": {
        "os": "linux",
        "arch": "amd64"
    },
    "process": {
        "terminal": true,
        "user": {
            "uid": 0,
            "gid": 0
        },
        "args": [
            "ls",
            "-la",
            "/proc/self/fd/"
        ],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "TERM=xterm"
        ],
        "cwd": "/",
        "capabilities": [
            "CAP_AUDIT_WRITE",
            "CAP_KILL",
            "CAP_NET_BIND_SERVICE"
        ],
        "rlimits": [
            {
                "type": "RLIMIT_NOFILE",
                "hard": 1024,
                "soft": 1024
            }
        ],
        "noNewPrivileges": true
    },
    "root": {
        "path": "rootfs",
        "readonly": true
    },
    "hostname": "runc",
    "mounts": [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc"
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/pts",
            "type": "devpts",
            "source": "devpts",
            "options": [
                "nosuid",
                "noexec",
                "newinstance",
                "ptmxmode=0666",
                "mode=0620"
            ]
        },
        {
            "destination": "/dev/shm",
            "type": "tmpfs",
            "source": "shm",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "mode=1777",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/mqueue",
            "type": "mqueue",
            "source": "mqueue",
            "options": [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/sys",
            "type": "none",
            "source": "/sys",
            "options": [
                "rbind",
                "nosuid",
                "noexec",
                "nodev",
                "ro"
            ]
        }
    ],
    "hooks": {},
    "linux": {
        "uidMappings": [
            {
                "hostID": 1000,
                "containerID": 0,
                "size": 1
            }
        ],
        "gidMappings": [
            {
                "hostID": 100,
                "containerID": 0,
                "size": 1
            }
        ],
        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            },
            {
                "type": "user"
            }
        ],
        "maskedPaths": [
            "/proc/kcore",
            "/proc/latency_stats",
            "/proc/timer_stats",
            "/proc/sched_debug"
        ],
        "readonlyPaths": [
            "/proc/asound",
            "/proc/bus",
            "/proc/fs",
            "/proc/irq",
            "/proc/sys",
            "/proc/sysrq-trigger"
        ]
    },
    "solaris": {
        "cappedCPU": {},
        "cappedMemory": {}
    }
}

cyphar commented 7 years ago

This is actually a kernel issue. A kernel hacker at SUSE has come up with the following patch. Assuming upstream says okay, this will be backported to stable kernels. I'm testing this right now to see if it fixes the runC issue (the test case I gave him was a regular Go program):

From fe82d463fd2ef1585d2c37bf9fa6a1761e6ee0e5 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 22 Sep 2016 10:09:34 +0200
Subject: [PATCH] signal: always signal tracer from the ptraced task

Aleksa has reported the following lockup when stracing the following go
program

% cat exec.go
package main

import (
    "os"
    "syscall"
)

func main() {
    syscall.Exec("/bin/echo", []string{"/bin/echo", "Hello"}, os.Environ())
}
$ go version
go version go1.6.3 linux/amd64
$ go build -o exec exec.go

$ strace -f ./exec
[snip]
[pid 10349] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 10346] <... select resumed> )      = 0 (Timeout)
[pid 10346] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 10345] execve("/bin/echo", ["/bin/echo", "Hello"], [/* 95 vars */] <unfinished ...>
[pid 10346] <... select resumed> )      = 0 (Timeout)
[pid 10349] <... select resumed> )      = 0 (Timeout)

execve will never finish unless the strace process get killed with
SIGKILL. The reason is the following deadlock

tracer              thread_A                thread_$N
SyS_process_vm_readv        SyS_execve              do_exit
                  do_execveat_common              exit_notify
                    prepare_bprm_creds              do_notify_parent
                      mutex_lock(cred_guard_mutex)        __group_send_sig_info
                    search_binary_handler               send_signal
                      load_elf_binary                     prepare_signal -> fail SIGHCHLD is SGL_DFL
                        flush_old_exec
                          # wait for sig->notify_count
  process_vm_rw
    process_vm_rw_core
      mm_access
        mutex_lock_killable(cred_guard_mutex)

So there seems to be 2 issues here. The first one is that exiting
threads (because of the ongoing exec) are not sending SIGCHLD signal
to the tracer but they rely on the tracer to reap them and call
release_task->__exit_signal which in turn would wake up the thread_A and
move on with the exec. The other part of the story is that the tracer
is not in do_wait but rather calls process_vm_readv (presumably to get
arguments of the syscall) and it waits for a lock in killable rather
than interruptible sleep.

The fix is therefore twofold. We want to teach mm_access to sleep in
interruptible sleep and we want to make sure that the traced child
will send the signal to the parent even when it is ignored or SIG_DFL.
sig_ignored already seems to be doing something along that line except
it doesn't handle when a traced child sends a signal to the tracer.
Fix this by checking the current ptrace status and whether the target
task is the tracer.

Reported-by: Aleksa Sarai <asarai@suse.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/fork.c   | 2 +-
 kernel/signal.c | 4 ++++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 5a57b9bab85c..d5b7c3aea187 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -837,7 +837,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
    struct mm_struct *mm;
    int err;

-   err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
+   err =  mutex_lock_interruptible(&task->signal->cred_guard_mutex);
    if (err)
        return ERR_PTR(err);

diff --git a/kernel/signal.c b/kernel/signal.c
index 96e9bc40667f..5c8b84b76f0b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -91,6 +91,10 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
    if (!sig_task_ignored(t, sig, force))
        return 0;

+   /* Do not ignore signals sent from child to the parent */
+   if (current->ptrace && current->parent == t)
+       return 0;
+
    /*
     * Tracers may want to know about even ignored signals.
     */
--
2.9.3

EDIT: I've tested the above kernel patch and it fixes it for me. Does this also work for you @mikebrow?

cyphar commented 7 years ago

It turns out the above patch actually "breaks" strace, and just happens to work. And the maintainer appears to believe this is a non-trivial thing to fix. Personally I'm going to keep applying the patch locally, since it does fix the problem for me.

marcosnils commented 7 years ago

@cyphar have you though about opening an issue in the Golang tracker to see if they can do something on their end to fix this?

cyphar commented 7 years ago

@marcosnils It's a kernel issue, not a Golang issue. You can reproduce it with pthreads -- it happens with any multithreaded program that calls execve under strace -f. Apparently this is a known issue upstream, but the solution is currently non-trivial (it requires a bunch of rewrites of some security hooks). I was looking into it, but to be honest it's all a bit over my head (the signalling semantics are super weird).

opencontainers / runc

runc run stalls under strace #1025