In a runv containerd setup we noticed that sometimes docker run fails with
docker: Error response from daemon: StartPod fail
The runv error message is
qemu_process.go:142 waitid: no child processes
The problem is that although docker has been notified that the run command failed it actually succeeded. The Qemu process has started successfully and keeps running. But it's not visible for docker.
In the error case this handler catches the SIGCHILD from the Qemu before cmd.Wait() in qemu_process.go. The signal handler calls osutils.Reap() which then calls Wait4 for the Qemu process. At that point in time the process is gone.
Later cmd.Wait() fails with
The error happens rarely but you can add some debug code to trigger it on every run
by splitting the cmd.Run() into cmd.Start() and cmd.Wait() and a short delay in between.
In a runv containerd setup we noticed that sometimes docker run fails with
The runv error message is
The problem is that although docker has been notified that the run command failed it actually succeeded. The Qemu process has started successfully and keeps running. But it's not visible for docker.
The root cause seems to be the global signal handler for SIGCHILD https://github.com/hyperhq/runv/blob/master/containerd/containerd.go#L143
In the error case this handler catches the SIGCHILD from the Qemu before cmd.Wait() in qemu_process.go. The signal handler calls osutils.Reap() which then calls Wait4 for the Qemu process. At that point in time the process is gone. Later cmd.Wait() fails with
That error message is correct. There is no child to wait for any more. https://github.com/hyperhq/runv/blob/master/hypervisor/qemu/qemu_process.go#L142
The cmd.Run() is actually a wrapper: https://golang.org/src/os/exec/exec.go#L266
The error happens rarely but you can add some debug code to trigger it on every run by splitting the cmd.Run() into cmd.Start() and cmd.Wait() and a short delay in between.