syscall: Go can leak a forked process if main thread exits before spawn finished

jacobvosmaer commented 5 years ago

Please answer these questions before submitting your issue. Thanks!

What did you do?

If I spawn a process in a goroutine, and exit the main thread before the spawn is finished, I am sometimes left with a fork of my original process that does not go away. The fork does not respond to SIGTERM, and sometimes consumes 100% CPU.

https://play.golang.org/p/1HPscvoAiwj

What did you expect to see?

If the main thread exits Go should not leave behind a forked process.

What did you see instead?

A forked process, sometimes using 100% CPU

Does this issue reproduce with the latest release (go1.12.7)?

Yes.

System details

go version go1.12.6 darwin/amd64
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/jacobvosmaer/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/jacobvosmaer/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
GOROOT/bin/go version: go version go1.12.6 darwin/amd64
GOROOT/bin/go tool compile -V: compile version go1.12.6
uname -v: Darwin Kernel Version 18.7.0: Thu Jun 20 18:42:21 PDT 2019; root:xnu-4903.270.47~4/RELEASE_X86_64
ProductName:    Mac OS X
ProductVersion: 10.14.6
BuildVersion:   18G87
lldb --version: lldb-1100.0.25.3
Apple Swift version 5.1 (swiftlang-1100.0.38.29 clang-1100.0.20.14)

jacobvosmaer commented 5 years ago

My guess is that when doing a fork+exec, after the fork, the child wants to "communicate" with the parent; perhaps by reading or writing to a pipe. If the parent is gone the child should exit and do nothing. But it appears that with the "right" timing, the child can get stuck in a loop instead.

jacobvosmaer commented 5 years ago

There are some lldb stack traces that might help here: https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016

agnivade commented 5 years ago

Can't reproduce on linux.

If the parent is gone the child should exit and do nothing.

I don't believe that is necessarily correct. If a parent exits, the child becomes a zombie and attaches to PID 1. I understand this issue is about cleaning up the child before the spawn has happened completely.

@randall77 for macOS.

jacobvosmaer commented 5 years ago

I also cannot reproduce on linux.

I understand this issue is about cleaning up the child before the spawn has happened completely.

Yes. In my toy example, I'm trying to spawn /bin/sh which will exit harmlessly on its own accord, even if its parent is gone. The problem is that when this bug strikes, we never get to the exec which creates the /bin/sh process.

I've learned some more about the real life bug that prompted this issue. It seems that there, even if the main goroutine does not exit, we get a stuck forked process. I was able to attach to some of these with lldb, which is what I linked to above. https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016

I'm not sure if this is the exact same state as my reproducing example, but it might be related. To save some clicking I'm including stack traces below.

mystery fork A, 0% CPU

``` * thread #1, stop reason = signal SIGSTOP * frame #0: 0x00007fff7eaee86a libsystem_kernel.dylib`__psynch_cvwait + 10 frame #1: 0x00007fff7ebad56e libsystem_pthread.dylib`_pthread_cond_wait + 722 frame #2: 0x00007fff7d1e6467 libobjc.A.dylib`monitor_tt::wait() + 19 frame #3: 0x00007fff7d1eb462 libobjc.A.dylib`WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize + 132 frame #4: 0x00007fff7d1eb79b libobjc.A.dylib`initializeNonMetaClass + 79 frame #5: 0x00007fff7d1eb79b libobjc.A.dylib`initializeNonMetaClass + 79 frame #6: 0x00007fff7d1ec62f libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, mutex_tt&, bool) + 187 frame #7: 0x00007fff7d1db690 libobjc.A.dylib`lookUpImpOrForward + 228 frame #8: 0x00007fff7d1db114 libobjc.A.dylib`_objc_msgSend_uncached + 68 frame #9: 0x00007fff7ebfafa1 libxpc.dylib`xpc_atfork_child + 125 frame #10: 0x00007fff7b99eb18 libSystem.B.dylib`libSystem_atfork_child + 54 frame #11: 0x00007fff7ea0fd97 libsystem_c.dylib`fork + 40 frame #12: 0x00000000010603ff gitaly`runtime.syscall + 31 frame #13: 0x000000000105dde0 gitaly`runtime.asmcgocall + 112 frame #14: 0x0000000001033930 gitaly`runtime.startTheWorldWithSema + 624 frame #15: 0x000000000104d40e gitaly`syscall.rawSyscall + 78 frame #16: 0x00000000010a21a2 gitaly`syscall.forkAndExecInChild + 226 frame #17: 0x00000000010a31a0 gitaly`syscall.forkExec + 864 frame #18: 0x00000000010c4a12 gitaly`os.startProcess + 610 frame #19: 0x00000000010c449c gitaly`os.StartProcess + 124 frame #20: 0x0000000001179f9c gitaly`os/exec.(*Cmd).Start + 1180 frame #21: 0x00000000016805a3 gitaly`gitlab.com/gitlab-org/gitaly/internal/supervisor.(*Process).start + 435 ```

mystery fork B, 100% CPU

``` * thread #1: tid = 0x43f1b, 0x00007fff7eba3317 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0) (lldb) bt * thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0) * frame #0: 0x00007fff7eba3317 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23 frame #1: 0x00007fff7eba114c libsystem_platform.dylib`_os_once_gate_wait + 196 frame #2: 0x00007fff7eb9ea6f libsystem_platform.dylib`_os_alloc_once + 42 frame #3: 0x00007fff7eb99d2a libsystem_notify.dylib`_notify_fork_child + 211 frame #4: 0x00007fff7b99eb13 libSystem.B.dylib`libSystem_atfork_child + 49 frame #5: 0x00007fff7ea0fd97 libsystem_c.dylib`fork + 40 frame #6: 0x00000000010603ff gitaly`runtime.syscall + 31 frame #7: 0x000000000105dde0 gitaly`runtime.asmcgocall + 112 frame #8: 0x0000000001033930 gitaly`runtime.startTheWorldWithSema + 624 frame #9: 0x000000000104d40e gitaly`syscall.rawSyscall + 78 frame #10: 0x00000000010a21a2 gitaly`syscall.forkAndExecInChild + 226 frame #11: 0x00000000010a31a0 gitaly`syscall.forkExec + 864 frame #12: 0x00000000010c4a12 gitaly`os.startProcess + 610 frame #13: 0x00000000010c449c gitaly`os.StartProcess + 124 frame #14: 0x0000000001179f9c gitaly`os/exec.(*Cmd).Start + 1180 frame #15: 0x0000000001680563 gitaly`gitlab.com/gitlab-org/gitaly/internal/supervisor.(*Process).start + 435 ```

When I lldb the crashing example I don't get nice symbols. Is it possible I messed up a go build command?

agnivade commented 5 years ago

I would also give 1.13rc1 a shot, just to be sure that it is not something already fixed. Keith can probably comment more on what exactly is happening.

ianlancetaylor commented 5 years ago

The stack traces make it appear that the problem is in the Darwin libc. The forked child is trying to get some response from the parent, but the parent has exited. If that is true, it should be possible to write a similar test case in C. Anybody want to try that?

jacobvosmaer commented 5 years ago

it should be possible to write a similar test case in C. Anybody want to try that?

It's the sort of challenge I enjoy so I'll take a look. But a person familiar with Go internals would be much faster at it than me. I won't be offended if such a person swoops in and beats me to it.

jacobvosmaer commented 5 years ago

OK that was not hard. Just translated my Go example to C.

https://gist.github.com/jacobvosmaer/69fae756e88d2f5f4ef5091ceeba1d88

As before, leave it running for a while in a restart loop, and you get a weird forked process left behind.

jacobvosmaer commented 5 years ago

The C reproducer highlights how odd this is even more: all it tries to do is fork followed by exit, yet the leaking forked processes are not exiting.

ianlancetaylor commented 5 years ago

Thanks!

Does anybody want to try to report this to Apple?

To fix this in Go I think it would suffice to have exit in runtime/sys_darwin.go acquire a write lock on syscall.ForkLock. Unfortunately that will be a bit of a hassle since the runtime package doesn't depend on the syscall package, and since syscall.ForkLock is exported it can't move.

jacobvosmaer commented 5 years ago

It really does feel like a macOS bug. With my C reproducer, if I leave it running it leaks way more processes than the Go one. Most of the leaked processes are at 0% CPU instead of 100% CPU.

* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff6d67ef06 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff6d73bd52 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 96
    frame #2: 0x00007fff6d7394cd libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 222
    frame #3: 0x00007fff6d5ed2b4 libsystem_c.dylib`__cxa_finalize_ranges + 36
    frame #4: 0x00007fff6d5ed6b3 libsystem_c.dylib`exit + 55
    frame #5: 0x0000000106b09e21 fork-leak`do_fork + 33
    frame #6: 0x00007fff6d73b2eb libsystem_pthread.dylib`_pthread_body + 126
    frame #7: 0x00007fff6d73e249 libsystem_pthread.dylib`_pthread_start + 66
    frame #8: 0x00007fff6d73a40d libsystem_pthread.dylib`thread_start + 13

What is particularly worrying here is that after I remove the leaked processes with kill -9, some part of the system (kernel? userspace kernel helper?) keeps spinning the CPU. It shows in the activity monitor as constant "system" load. I can only get rid of that by rebooting.

Screenshot 2019-08-29 at 09 31 49

I don't feel confident reporting this to Apple for a number of reasons:

I don't have enough experience with C and Unix programming to stand behind my C reproducer
I don't know to what extent the C reproducer matches the behavior of the Go runtime: even if the C reproducer is proof of a macOS bug, do we know if it's the same bug I see in Go?

If they respond to my C reproducer by saying "your code is wrong" or "you should use posix_spawn" then I have nothing to say to that.

jqmichael commented 4 years ago

I don't feel confident reporting this to Apple for a number of reasons:

I don't have enough experience with C and Unix programming to stand behind my C reproducer

I don't know to what extent the C reproducer matches the behavior of the Go runtime: even if the C reproducer is proof of a macOS bug, do we know if it's the same bug I see in Go?

If they respond to my C reproducer by saying "your code is wrong" or "you should use posix_spawn" then I have nothing to say to that.

That's certainly true. But we are still making progress even if Apple proved we were doing wrong, isn't it?

Kila2 commented 3 years ago

I tried various methods. Can we reslove?

Process 38893 stopped
* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x00007fff731566b9 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
libsystem_platform.dylib`_os_once_gate_corruption_abort:
->  0x7fff731566b9 <+23>: ud2    

libsystem_platform.dylib`_os_lock_recursive_abort:
    0x7fff731566bb <+0>:  movl   %edi, %eax
    0x7fff731566bd <+2>:  leaq   0x7d1(%rip), %rcx         ; "BUG IN CLIENT OF LIBPLATFORM: Trying to recursively lock an os_lock"
    0x7fff731566c4 <+9>:  movq   %rcx, 0x2651c3ad(%rip)    ; gCRAnnotations + 8
Target 0: (bit) stopped.

* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007fff731566b9 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007fff73152feb libsystem_platform.dylib`_os_once_gate_wait + 196
    frame #2: 0x00007fff73150f89 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007fff73038c75 libsystem_coreservices.dylib`_libcoreservices_fork_child + 78
    frame #4: 0x00007fff6ff40aa9 libSystem.B.dylib`libSystem_atfork_child + 44
    frame #5: 0x00007fff72fba8ad libsystem_c.dylib`fork + 40
    frame #6: 0x000000000106f73f bit`runtime.syscall + 31
    frame #7: 0x000000000106d5b0 bit`runtime.asmcgocall + 112
    frame #8: 0x000000000103c8c0 bit`runtime.startTheWorldWithSema + 576
    frame #9: 0x000000000106a86c bit`syscall.rawSyscall + 76
    frame #10: 0x00000000010afca6 bit`syscall.forkAndExecInChild + 230
    frame #11: 0x00000000010b0eb8 bit`syscall.forkExec + 1080
    frame #12: 0x00000000010da23b bit`os.startProcess + 667
    frame #13: 0x00000000010d9c3c bit`os.StartProcess + 124
    frame #14: 0x0000000001342505 bit`os/exec.(*Cmd).Start + 1317
    frame #15: 0x0000000001341f6b bit`os/exec.(*Cmd).Run + 43
    frame #16: 0x0000000001343171 bit`os/exec.(*Cmd).CombinedOutput + 145

cherrymui commented 3 years ago

For the C reproducer in https://github.com/golang/go/issues/33565#issuecomment-525877539 , I can reproduce on macOS. I can also reproduce it on Linux, with even a much higher rate of leaking children. It seems the child stuck at exit call:

#0  __lll_lock_wait_private (futex=futex@entry=0x7f0507afaae8 <__exit_funcs_lock>) at ./lowlevellock.c:35
#1  0x00007f05079785f8 in __run_exit_handlers (status=0, listp=0x7f0507af8718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:56
#2  0x00007f050797867a in __GI_exit (status=<optimized out>) at exit.c:139
#3  0x0000561c884091d5 in do_fork ()
#4  0x00007f0507b07ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0507a37def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

If we exec instead of exit in child, it doesn't seem to stuck. So it may be a different issue. The original bug seems that it stuck at the fork call instead of exit.

cherrymui commented 1 year ago

This may be similar to #53863 ? (which is a macOS system bug and may be fixed in a future macOS version.)

rsc commented 1 year ago

I am getting these unkillable processes in the os/exec test during all.bash pretty reliably. Here is a stack trace from one:

* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212
    frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349
    frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58
    frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31
    frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100
    frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139
    frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240
    frame #10: 0x0000000001077f5f exec.test`syscall.forkExec + 863
    frame #11: 0x000000000109b992 exec.test`os.startProcess + 818
    frame #12: 0x000000000109b59a exec.test`os.StartProcess + 90
    frame #13: 0x00000000010ffb0e exec.test`os/exec.(*Cmd).Start + 1518
    frame #14: 0x00000000012d5045 exec.test`os/exec_test.TestEchoFileRace + 165
    frame #15: 0x00000000010ecaab exec.test`testing.tRunner + 267
    frame #16: 0x00000000010edaca exec.test`testing.(*T).Run.func1 + 42
    frame #17: 0x000000000106b5c1 exec.test`runtime.goexit.abi0 + 1

Is there anything we can do?

Kila2 commented 1 year ago

I am getting these unkillable processes in the os/exec test during all.bash pretty reliably. Here is a stack trace from one:

* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212
    frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349
    frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58
    frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31
    frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100
    frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139
    frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240
    frame #10: 0x0000000001077f5f exec.test`syscall.forkExec + 863
    frame #11: 0x000000000109b992 exec.test`os.startProcess + 818
    frame #12: 0x000000000109b59a exec.test`os.StartProcess + 90
    frame #13: 0x00000000010ffb0e exec.test`os/exec.(*Cmd).Start + 1518
    frame #14: 0x00000000012d5045 exec.test`os/exec_test.TestEchoFileRace + 165
    frame #15: 0x00000000010ecaab exec.test`testing.tRunner + 267
    frame #16: 0x00000000010edaca exec.test`testing.(*T).Run.func1 + 42
    frame #17: 0x000000000106b5c1 exec.test`runtime.goexit.abi0 + 1

Is there anything we can do?

try refactor code async logic to sync or before syscall wait async code finish. I guess maybe when after syscall forked. they will inherit parent process and parent process is in async state. the child process memory is dirty

rsc commented 1 year ago

I have a fix for the Go stacks that were reported in this issue. I am not as sure about the C reproducer - it may be that it tickles a different fork+exec bug in Apple libc.

gopherbot commented 1 year ago

Change https://go.dev/cl/451735 mentions this issue: runtime: work around Apple libc bugs to make exec stop hanging

gopherbot commented 1 year ago

Change https://go.dev/cl/459175 mentions this issue: runtime: revert Apple libc atfork workaround

gopherbot commented 1 year ago

Change https://go.dev/cl/459176 mentions this issue: runtime: call __fork instead of fork on darwin

gopherbot commented 1 year ago

Change https://go.dev/cl/459178 mentions this issue: runtime: call __fork instead of fork on darwin

gopherbot commented 1 year ago

Change https://go.dev/cl/459179 mentions this issue: [release-branch.go1.18] runtime: call __fork instead of fork on darwin

gopherbot commented 1 year ago

Change https://go.dev/cl/460476 mentions this issue: runtime: Apple libc atfork workaround take 3

golang / go