golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.69k stars 17.62k forks source link

os/exec: tests hang on macOS due to Apple libc fork bugs #56784

Closed rsc closed 1 year ago

rsc commented 1 year ago

On my x86 Mac laptop using macOS 12.6.1, all.bash often hangs in the os/exec test. In particular, this never finishes:

% cd go/src/os/exec
% for (i in `{seq 100}) go test -short -count=1

The chance of a hang in any given iteration is something like 50%. It's possible this is related to #33565, but I'm opening a separate bug just in case, and to focus the discussion on the fact that our own os/exec tests don't pass.

If I attach to the hung process in lldb, I was originally seeing backtraces like:

  * frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212
    frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349
    frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58
    frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31
    frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100
    frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139
    frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240

This specific hang seems to match https://github.com/dart-lang/sdk/issues/29539, and inspection of the Apple libc code shows that the problem is a race with an os_alloc_once that is in progress in the parent when the address space is split, making the same call die in the child. I changed the Go runtime to do an early call to notify_is_valid_token(0) in osinit. That call is a no-op except that it guarantees the os_alloc_once has been done already, so it cannot race with any future forks.

With that fix, I get a different hang:

  * frame #0: 0x00007ff8015e53ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff80161fa6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x00007ff8014c2aca libobjc.A.dylib`WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize + 115
    frame #3: 0x00007ff8014b4194 libobjc.A.dylib`initializeNonMetaClass + 646
    frame #4: 0x00007ff8014b3f6a libobjc.A.dylib`initializeNonMetaClass + 92
    frame #5: 0x00007ff8014b3f6a libobjc.A.dylib`initializeNonMetaClass + 92
    frame #6: 0x00007ff8014b3c18 libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, mutex_tt<false>&, bool) + 232
    frame #7: 0x00007ff8014b3995 libobjc.A.dylib`lookUpImpOrForward + 1087
    frame #8: 0x00007ff8014b2f9b libobjc.A.dylib`_objc_msgSend_uncached + 75
    frame #9: 0x00007ff80137145e libxpc.dylib`xpc_atfork_child + 125
    frame #10: 0x00007ff80c41ac8e libSystem.B.dylib`libSystem_atfork_child + 63
    frame #11: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #12: 0x000000000106d5df exec.test`runtime.syscall.abi0 + 31
    frame #13: 0x000000000106b444 exec.test`runtime.asmcgocall.abi0 + 100
    frame #14: 0x0000000001069589 exec.test`runtime.systemstack.abi0 + 73
    frame #15: 0x00000000010682ab exec.test`syscall.rawSyscall + 139
    frame #16: 0x0000000001077250 exec.test`syscall.forkAndExecInChild + 240

This one seems to match what @jacobvosmaer posted in https://github.com/golang/go/issues/33565#issuecomment-522674590.

I can't find the libobjc source code so I'm not sure what a workaround for xpc_atfork_child might be.

It must be that C programs on macOS do not use fork. I looked into posix_spawn but it looks like we don't have any other ports that use that.

We need to figure something out for Go 1.20 though.

randall77 commented 1 year ago

My Mac Pro (x86) running 13.0.1 has no trouble, with either Go 1.19.3 or tip.

rsc commented 1 year ago

Updating my laptop to 13.0 did not work. It did work - 100 PASS in a row - to add another osinit-time call, this one to xpc_atfork_child. That seems to warm up whatever fast paths it will need in the child to avoid deadlocks. As I write this I've been offered 13.0.1 so I will try that update just to double-check that this is really needed.

gopherbot commented 1 year ago

Change https://go.dev/cl/451735 mentions this issue: runtime: work around Apple libc bugs to make exec stop hanging

rsc commented 1 year ago

13.0.1 didn't help. Same problem on my M1 MacBook Pro with 12.0.1. go.dev/cl/451735 fixes them both.

rsc commented 1 year ago

We have since discovered that what is unique about my laptop compared to others is that I was running my tests from under a program written and linked against CoreFoundation, which put a magic __CF_USER_TEXT_ENCODING=0x552D:0x0:0x0 variable in my environment. (The 0x552D is my uid.) The os/exec test is linked against CoreFoundation too. If I clear that variable or mangle it to have the wrong uid, causing it to be recomputed, then different things happen at startup and seem to bump things around a bit so that these two fork hangs are far less likely.

We'll keep the fix, it's just even more mysterious now.

jacobvosmaer commented 1 year ago

I filed the related issue https://github.com/golang/go/issues/33565. I just checked, and I also have this __CF_USER_TEXT_ENCODING environment variable set in my development environment (plan9port Acme).

This does not tell us anything new but it corroborates what Russ wrote above.

rsc commented 1 year ago

@gopherbot please backport

gopherbot commented 1 year ago

Backport issue(s) opened: #56836 (for 1.18), #56837 (for 1.19).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

gopherbot commented 1 year ago

Change https://go.dev/cl/459175 mentions this issue: runtime: revert Apple libc atfork workaround

gopherbot commented 1 year ago

Change https://go.dev/cl/459176 mentions this issue: runtime: call __fork instead of fork on darwin

gopherbot commented 1 year ago

Change https://go.dev/cl/459178 mentions this issue: runtime: call __fork instead of fork on darwin

gopherbot commented 1 year ago

Change https://go.dev/cl/459179 mentions this issue: [release-branch.go1.18] runtime: call __fork instead of fork on darwin

gopherbot commented 1 year ago

Change https://go.dev/cl/460476 mentions this issue: runtime: Apple libc atfork workaround take 3

dwlnetnl commented 1 year ago

There is source code available for the Objective-C runtime on opensource.apple.com which redirects to https://github.com/apple-oss-distributions/objc4