Closed jacobvosmaer closed 1 year ago
My guess is that when doing a fork+exec, after the fork, the child wants to "communicate" with the parent; perhaps by reading or writing to a pipe. If the parent is gone the child should exit and do nothing. But it appears that with the "right" timing, the child can get stuck in a loop instead.
There are some lldb stack traces that might help here: https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016
Can't reproduce on linux.
If the parent is gone the child should exit and do nothing.
I don't believe that is necessarily correct. If a parent exits, the child becomes a zombie and attaches to PID 1. I understand this issue is about cleaning up the child before the spawn has happened completely.
@randall77 for macOS.
I also cannot reproduce on linux.
I understand this issue is about cleaning up the child before the spawn has happened completely.
Yes. In my toy example, I'm trying to spawn /bin/sh
which will exit harmlessly on its own accord, even if its parent is gone. The problem is that when this bug strikes, we never get to the exec
which creates the /bin/sh
process.
I've learned some more about the real life bug that prompted this issue. It seems that there, even if the main goroutine does not exit, we get a stuck forked process. I was able to attach to some of these with lldb, which is what I linked to above. https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016
I'm not sure if this is the exact same state as my reproducing example, but it might be related. To save some clicking I'm including stack traces below.
When I lldb the crashing example I don't get nice symbols. Is it possible I messed up a go build
command?
I would also give 1.13rc1 a shot, just to be sure that it is not something already fixed. Keith can probably comment more on what exactly is happening.
The stack traces make it appear that the problem is in the Darwin libc. The forked child is trying to get some response from the parent, but the parent has exited. If that is true, it should be possible to write a similar test case in C. Anybody want to try that?
it should be possible to write a similar test case in C. Anybody want to try that?
It's the sort of challenge I enjoy so I'll take a look. But a person familiar with Go internals would be much faster at it than me. I won't be offended if such a person swoops in and beats me to it.
OK that was not hard. Just translated my Go example to C.
https://gist.github.com/jacobvosmaer/69fae756e88d2f5f4ef5091ceeba1d88
As before, leave it running for a while in a restart loop, and you get a weird forked process left behind.
The C reproducer highlights how odd this is even more: all it tries to do is fork
followed by exit
, yet the leaking forked processes are not exiting.
Thanks!
Does anybody want to try to report this to Apple?
To fix this in Go I think it would suffice to have exit
in runtime/sys_darwin.go acquire a write lock on syscall.ForkLock. Unfortunately that will be a bit of a hassle since the runtime package doesn't depend on the syscall package, and since syscall.ForkLock is exported it can't move.
It really does feel like a macOS bug. With my C reproducer, if I leave it running it leaks way more processes than the Go one. Most of the leaked processes are at 0% CPU instead of 100% CPU.
* thread #1, stop reason = signal SIGSTOP
* frame #0: 0x00007fff6d67ef06 libsystem_kernel.dylib`__psynch_mutexwait + 10
frame #1: 0x00007fff6d73bd52 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 96
frame #2: 0x00007fff6d7394cd libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 222
frame #3: 0x00007fff6d5ed2b4 libsystem_c.dylib`__cxa_finalize_ranges + 36
frame #4: 0x00007fff6d5ed6b3 libsystem_c.dylib`exit + 55
frame #5: 0x0000000106b09e21 fork-leak`do_fork + 33
frame #6: 0x00007fff6d73b2eb libsystem_pthread.dylib`_pthread_body + 126
frame #7: 0x00007fff6d73e249 libsystem_pthread.dylib`_pthread_start + 66
frame #8: 0x00007fff6d73a40d libsystem_pthread.dylib`thread_start + 13
What is particularly worrying here is that after I remove the leaked processes with kill -9
, some part of the system (kernel? userspace kernel helper?) keeps spinning the CPU. It shows in the activity monitor as constant "system" load. I can only get rid of that by rebooting.
I don't feel confident reporting this to Apple for a number of reasons:
If they respond to my C reproducer by saying "your code is wrong" or "you should use posix_spawn
" then I have nothing to say to that.
I don't feel confident reporting this to Apple for a number of reasons:
- I don't have enough experience with C and Unix programming to stand behind my C reproducer
- I don't know to what extent the C reproducer matches the behavior of the Go runtime: even if the C reproducer is proof of a macOS bug, do we know if it's the same bug I see in Go?
If they respond to my C reproducer by saying "your code is wrong" or "you should use
posix_spawn
" then I have nothing to say to that.
That's certainly true. But we are still making progress even if Apple proved we were doing wrong, isn't it?
I tried various methods. Can we reslove?
Process 38893 stopped
* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
frame #0: 0x00007fff731566b9 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
libsystem_platform.dylib`_os_once_gate_corruption_abort:
-> 0x7fff731566b9 <+23>: ud2
libsystem_platform.dylib`_os_lock_recursive_abort:
0x7fff731566bb <+0>: movl %edi, %eax
0x7fff731566bd <+2>: leaq 0x7d1(%rip), %rcx ; "BUG IN CLIENT OF LIBPLATFORM: Trying to recursively lock an os_lock"
0x7fff731566c4 <+9>: movq %rcx, 0x2651c3ad(%rip) ; gCRAnnotations + 8
Target 0: (bit) stopped.
* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
* frame #0: 0x00007fff731566b9 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
frame #1: 0x00007fff73152feb libsystem_platform.dylib`_os_once_gate_wait + 196
frame #2: 0x00007fff73150f89 libsystem_platform.dylib`_os_alloc_once + 42
frame #3: 0x00007fff73038c75 libsystem_coreservices.dylib`_libcoreservices_fork_child + 78
frame #4: 0x00007fff6ff40aa9 libSystem.B.dylib`libSystem_atfork_child + 44
frame #5: 0x00007fff72fba8ad libsystem_c.dylib`fork + 40
frame #6: 0x000000000106f73f bit`runtime.syscall + 31
frame #7: 0x000000000106d5b0 bit`runtime.asmcgocall + 112
frame #8: 0x000000000103c8c0 bit`runtime.startTheWorldWithSema + 576
frame #9: 0x000000000106a86c bit`syscall.rawSyscall + 76
frame #10: 0x00000000010afca6 bit`syscall.forkAndExecInChild + 230
frame #11: 0x00000000010b0eb8 bit`syscall.forkExec + 1080
frame #12: 0x00000000010da23b bit`os.startProcess + 667
frame #13: 0x00000000010d9c3c bit`os.StartProcess + 124
frame #14: 0x0000000001342505 bit`os/exec.(*Cmd).Start + 1317
frame #15: 0x0000000001341f6b bit`os/exec.(*Cmd).Run + 43
frame #16: 0x0000000001343171 bit`os/exec.(*Cmd).CombinedOutput + 145
For the C reproducer in https://github.com/golang/go/issues/33565#issuecomment-525877539 , I can reproduce on macOS. I can also reproduce it on Linux, with even a much higher rate of leaking children. It seems the child stuck at exit
call:
#0 __lll_lock_wait_private (futex=futex@entry=0x7f0507afaae8 <__exit_funcs_lock>) at ./lowlevellock.c:35
#1 0x00007f05079785f8 in __run_exit_handlers (status=0, listp=0x7f0507af8718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:56
#2 0x00007f050797867a in __GI_exit (status=<optimized out>) at exit.c:139
#3 0x0000561c884091d5 in do_fork ()
#4 0x00007f0507b07ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007f0507a37def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
If we exec
instead of exit
in child, it doesn't seem to stuck. So it may be a different issue. The original bug seems that it stuck at the fork
call instead of exit
.
This may be similar to #53863 ? (which is a macOS system bug and may be fixed in a future macOS version.)
I am getting these unkillable processes in the os/exec test during all.bash pretty reliably. Here is a stack trace from one:
* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
* frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212
frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42
frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349
frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58
frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84
frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31
frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100
frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139
frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240
frame #10: 0x0000000001077f5f exec.test`syscall.forkExec + 863
frame #11: 0x000000000109b992 exec.test`os.startProcess + 818
frame #12: 0x000000000109b59a exec.test`os.StartProcess + 90
frame #13: 0x00000000010ffb0e exec.test`os/exec.(*Cmd).Start + 1518
frame #14: 0x00000000012d5045 exec.test`os/exec_test.TestEchoFileRace + 165
frame #15: 0x00000000010ecaab exec.test`testing.tRunner + 267
frame #16: 0x00000000010edaca exec.test`testing.(*T).Run.func1 + 42
frame #17: 0x000000000106b5c1 exec.test`runtime.goexit.abi0 + 1
Is there anything we can do?
I am getting these unkillable processes in the os/exec test during all.bash pretty reliably. Here is a stack trace from one:
* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0) * frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23 frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212 frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42 frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349 frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58 frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84 frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31 frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100 frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139 frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240 frame #10: 0x0000000001077f5f exec.test`syscall.forkExec + 863 frame #11: 0x000000000109b992 exec.test`os.startProcess + 818 frame #12: 0x000000000109b59a exec.test`os.StartProcess + 90 frame #13: 0x00000000010ffb0e exec.test`os/exec.(*Cmd).Start + 1518 frame #14: 0x00000000012d5045 exec.test`os/exec_test.TestEchoFileRace + 165 frame #15: 0x00000000010ecaab exec.test`testing.tRunner + 267 frame #16: 0x00000000010edaca exec.test`testing.(*T).Run.func1 + 42 frame #17: 0x000000000106b5c1 exec.test`runtime.goexit.abi0 + 1
Is there anything we can do?
try refactor code async logic to sync or before syscall wait async code finish. I guess maybe when after syscall forked. they will inherit parent process and parent process is in async state. the child process memory is dirty
I have a fix for the Go stacks that were reported in this issue. I am not as sure about the C reproducer - it may be that it tickles a different fork+exec bug in Apple libc.
Change https://go.dev/cl/451735 mentions this issue: runtime: work around Apple libc bugs to make exec stop hanging
Change https://go.dev/cl/459175 mentions this issue: runtime: revert Apple libc atfork workaround
Change https://go.dev/cl/459176 mentions this issue: runtime: call __fork instead of fork on darwin
Change https://go.dev/cl/459178 mentions this issue: runtime: call __fork instead of fork on darwin
Change https://go.dev/cl/459179 mentions this issue: [release-branch.go1.18] runtime: call __fork instead of fork on darwin
Change https://go.dev/cl/460476 mentions this issue: runtime: Apple libc atfork workaround take 3
Please answer these questions before submitting your issue. Thanks!
What did you do?
If I spawn a process in a goroutine, and exit the main thread before the spawn is finished, I am sometimes left with a fork of my original process that does not go away. The fork does not respond to SIGTERM, and sometimes consumes 100% CPU.
https://play.golang.org/p/1HPscvoAiwj
What did you expect to see?
If the main thread exits Go should not leave behind a forked process.
What did you see instead?
A forked process, sometimes using 100% CPU
Does this issue reproduce with the latest release (go1.12.7)?
Yes.
System details