Closed bcmills closed 1 year ago
Thanks for digging into this!
There's a lot of threads tangled up here. Some questions:
ps
and bt/a
commands? (See ddb(4) for information about the commands that crash(8) supports.)Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && `^\s*runtime\.gopark`
2022-10-12 22:55 netbsd-386-9_0 go@1a7f08cf
Hmm, that one could plausibly be a deadlock in a Go subprocess, which may or may not be related (the test really ought to send SIGQUIT to help debug). I'll update the watchflakes
pattern to try to filter it out.
The other failures that I think are related to this issue are:
with lots of stacks like:
runtime.gopark(0xc00049cb38?, 0x4222a5?, 0xa0?, 0x54?, 0x4b097e?)
/tmp/workdir/go/src/runtime/proc.go:378 +0xd6 fp=0xc00049caf8 sp=0xc00049cad8 pc=0x439bf6
runtime.goparkunlock(...)
/tmp/workdir/go/src/runtime/proc.go:384
runtime.semacquire1(0xf94eb8, 0x67?, 0x0, 0x0, 0xc0?)
/tmp/workdir/go/src/runtime/sema.go:160 +0x20f fp=0xc00049cb60 sp=0xc00049caf8 pc=0x44ad2f
runtime.semacquire(...)
/tmp/workdir/go/src/runtime/sema.go:111
runtime.gcStart({0x7f7ff7e9a878?, 0x480317?, 0x49cc10?})
/tmp/workdir/go/src/runtime/mgc.go:607 +0xf5 fp=0xc00049cbe8 sp=0xc00049cb60 pc=0x41a675
runtime.mallocgc(0x90, 0xb026c0, 0x1)
/tmp/workdir/go/src/runtime/malloc.go:1145 +0x7f4 fp=0xc00049cc58 sp=0xc00049cbe8 pc=0x40ec34
and the aforementioned local deadlock with stacks like
runtime.gopark(0x9ae040?, 0x425e45?, 0xe6?, 0x79?, 0x2000?)
/tmp/workdir/go/src/runtime/proc.go:378 +0xd6 fp=0xc000231b70 sp=0xc000231b50 pc=0x4392b6
runtime.goparkunlock(...)
/tmp/workdir/go/src/runtime/proc.go:384
runtime.semacquire1(0x9a873c, 0xe7?, 0x0, 0x0, 0x0?)
/tmp/workdir/go/src/runtime/sema.go:160 +0x20f fp=0xc000231bd8 sp=0xc000231b70 pc=0x44a2cf
runtime.semacquire(...)
/tmp/workdir/go/src/runtime/sema.go:111
runtime.gcMarkDone()
/tmp/workdir/go/src/runtime/mgc.go:787 +0x2f fp=0xc000231c38 sp=0xc000231bd8 pc=0x41a56f
runtime.gcAssistAlloc(0xc0004d0ea0)
/tmp/workdir/go/src/runtime/mgcmark.go:483 +0x1fc fp=0xc000231c98 sp=0xc000231c38 pc=0x41e15c
runtime.mallocgc(0x200, 0x0, 0x0)
/tmp/workdir/go/src/runtime/malloc.go:913 +0x11a fp=0xc000231d08 sp=0xc000231c98 pc=0x40deda
In https://storage.googleapis.com/go-build-log/c2fb4188/netbsd-amd64-9_0_06fa9a39.log, the relevant syscalls seem to be:
goroutine 0 [idle]:
runtime.kevent()
/tmp/workdir/go/src/runtime/sys_netbsd_amd64.s:432 +0x23 fp=0x7f7fcbbff338 sp=0x7f7fcbbff330 pc=0x46ba03
runtime.netpoll(0x9847fe16ad?)
/tmp/workdir/go/src/runtime/netpoll_kqueue.go:130 +0x10c fp=0x7f7fcbbffdc8 sp=0x7f7fcbbff338 pc=0x432fac
runtime.findRunnable()
/tmp/workdir/go/src/runtime/proc.go:2952 +0x765 fp=0x7f7fcbbffed0 sp=0x7f7fcbbffdc8 pc=0x43e605
runtime.schedule()
/tmp/workdir/go/src/runtime/proc.go:3341 +0xb1 fp=0x7f7fcbbfff08 sp=0x7f7fcbbffed0 pc=0x43f791
runtime.mstart1()
/tmp/workdir/go/src/runtime/proc.go:1499 +0xcd fp=0x7f7fcbbfff30 sp=0x7f7fcbbfff08 pc=0x43bb8d
runtime.mstart0()
/tmp/workdir/go/src/runtime/proc.go:1449 +0x79 fp=0x7f7fcbbfff60 sp=0x7f7fcbbfff30 pc=0x43ba99
runtime.mstart()
/tmp/workdir/go/src/runtime/asm_amd64.s:390 +0x5 fp=0x7f7fcbbfff68 sp=0x7f7fcbbfff60 pc=0x467945
That's in a call to __kevent50
:
https://cs.opensource.google/go/go/+/master:src/runtime/sys_netbsd_amd64.s;l=430-432;drc=cb6fc99b32cd5ec80e98f0eaec18bcfffba572ca
goroutine 32 [syscall]:
runtime.notetsleepg(0x419eb1?, 0x9d80e0?)
/tmp/workdir/go/src/runtime/lock_sema.go:294 +0x34 fp=0xc000057bf0 sp=0xc000057ba8 pc=0x40cef4
runtime.gcBgMarkStartWorkers()
/tmp/workdir/go/src/runtime/mgc.go:1192 +0x38 fp=0xc000057c10 sp=0xc000057bf0 pc=0x41b678
runtime.gcStart({0x7f7ff7e995b8?, 0xc00018ec35?, 0x532dd0?})
/tmp/workdir/go/src/runtime/mgc.go:645 +0x21b fp=0xc000057c98 sp=0xc000057c10 pc=0x41a11b
runtime.mallocgc(0x200, 0x0, 0x0)
/tmp/workdir/go/src/runtime/malloc.go:1145 +0x7f4 fp=0xc000057d08 sp=0xc000057c98 pc=0x40e5b4
That's acquiring a semaphore, but the stack trace is truncated: https://cs.opensource.google/go/go/+/master:src/runtime/lock_sema.go;l=294-295;drc=19309779ac5e2f5a2fd3cbb34421dafb2855ac21
It's probably in ___lwp_park60
?
goroutine 39 [syscall]:
syscall.Syscall6(0x1e1, 0x1, 0x36e4, 0x0, 0x10020, 0x0, 0x0)
/tmp/workdir/go/src/syscall/asm_unix_amd64.s:39 +0x5 fp=0xc00005cd28 sp=0xc00005cd20 pc=0x478245
os.(*Process).blockUntilWaitable(0xc0001ca000)
/tmp/workdir/go/src/os/wait_wait6.go:33 +0x5a fp=0xc00005cd90 sp=0xc00005cd28 pc=0x49d6da
os.(*Process).wait(0xc0001ca000)
/tmp/workdir/go/src/os/exec_unix.go:22 +0x28 fp=0xc00005cdf0 sp=0xc00005cd90 pc=0x497d08
os.(*Process).Wait(...)
/tmp/workdir/go/src/os/exec.go:132
goroutine 45 [syscall]:
syscall.Syscall6(0x1e1, 0x1, 0x37ca, 0x0, 0x10020, 0x0, 0x0)
/tmp/workdir/go/src/syscall/asm_unix_amd64.s:39 +0x5 fp=0xc000205dd8 sp=0xc000205dd0 pc=0x478245
os.(*Process).blockUntilWaitable(0xc00024c090)
/tmp/workdir/go/src/os/wait_wait6.go:33 +0x5a fp=0xc000205e40 sp=0xc000205dd8 pc=0x49d6da
os.(*Process).wait(0xc00024c090)
/tmp/workdir/go/src/os/exec_unix.go:22 +0x28 fp=0xc000205ea0 sp=0xc000205e40 pc=0x497d08
os.(*Process).Wait(...)
/tmp/workdir/go/src/os/exec.go:132
Those are both wait6
with P_PID
, a specific PID, WEXITED|WNOWAIT
, and null pointers for everything else.
https://storage.googleapis.com/go-build-log/267326a7/netbsd-amd64-9_0_fc356467.log has these:
goroutine 0 [idle]:
runtime.lwp_park()
/tmp/workdir/go/src/runtime/sys_netbsd_amd64.s:100 +0x23 fp=0x7f7fcadefd00 sp=0x7f7fcadefcf8 pc=0x46ce83
runtime.semasleep(0xffffffffffffffff)
/tmp/workdir/go/src/runtime/os_netbsd.go:181 +0x125 fp=0x7f7fcadefd60 sp=0x7f7fcadefd00 pc=0x433e85
runtime.notesleep(0x4460b9?)
/tmp/workdir/go/src/runtime/lock_sema.go:181 +0x71 fp=0x7f7fcadefd98 sp=0x7f7fcadefd60 pc=0x40d191
runtime.mPark(...)
/tmp/workdir/go/src/runtime/proc.go:1524
runtime.stopm()
/tmp/workdir/go/src/runtime/proc.go:2315 +0x8c fp=0x7f7fcadefdc8 sp=0x7f7fcadefd98 pc=0x43da4c
runtime.findRunnable()
/tmp/workdir/go/src/runtime/proc.go:2988 +0xabc fp=0x7f7fcadefed0 sp=0x7f7fcadefdc8 pc=0x43f29c
(___lwp_park60
)
goroutine 15 [syscall]:
runtime.notetsleepg(0x41a531?, 0xfc54a0?)
/tmp/workdir/go/src/runtime/lock_sema.go:294 +0x34 fp=0xc000604ce0 sp=0xc000604c98 pc=0x40d574
runtime.gcBgMarkStartWorkers()
/tmp/workdir/go/src/runtime/mgc.go:1192 +0x38 fp=0xc000604d00 sp=0xc000604ce0 pc=0x41bcf8
runtime.gcStart({0x7f7ff7e9ad28?, 0x15?, 0x604dc0?})
/tmp/workdir/go/src/runtime/mgc.go:645 +0x21b fp=0xc000604d88 sp=0xc000604d00 pc=0x41a79b
(also ___lwp_park60
, but reached in a different way)
There are no other syscalls in flight that I can see, but I'm not very familiar with runtime debugging (especially on NetBSD).
I will also note that ___lwp_park60
is the same system call observed in the C program in https://github.com/golang/go/issues/50138#issuecomment-1276074607. 🙂
https://storage.googleapis.com/go-build-log/5e8757a0/netbsd-amd64-9_0_78a520df.log (a TryBot on https://go.dev/cl/442575) is another example fitting the pattern.
The findRunnable
goroutine is idling in lwp_park
, and another goroutine is marked runnable
but not running. That test specifically tries to produce a signal during an exec
, and spawns several SignalDuringExec
goroutines of which only one is still running, so we can infer that the other goroutines must have previously completed their fork
, exec
, wait6
, and wait4
calls.
goroutine 0 [idle]:
runtime.lwp_park()
/tmp/workdir/go/src/runtime/sys_netbsd_amd64.s:100 +0x23 fp=0x7f7fcbfffce0 sp=0x7f7fcbfffcd8 pc=0x464a63
runtime.semasleep(0xffffffffffffffff)
/tmp/workdir/go/src/runtime/os_netbsd.go:181 +0x125 fp=0x7f7fcbfffd40 sp=0x7f7fcbfffce0 pc=0x431145
runtime.notesleep(0x442e39?)
/tmp/workdir/go/src/runtime/lock_sema.go:181 +0x71 fp=0x7f7fcbfffd78 sp=0x7f7fcbfffd40 pc=0x40c2f1
runtime.mPark(...)
/tmp/workdir/go/src/runtime/proc.go:1524
runtime.stopm()
/tmp/workdir/go/src/runtime/proc.go:2315 +0x8c fp=0x7f7fcbfffda8 sp=0x7f7fcbfffd78 pc=0x43a8ec
runtime.findRunnable()
/tmp/workdir/go/src/runtime/proc.go:2988 +0xabc fp=0x7f7fcbfffeb0 sp=0x7f7fcbfffda8 pc=0x43c13c
runtime.schedule()
/tmp/workdir/go/src/runtime/proc.go:3341 +0xb1 fp=0x7f7fcbfffee8 sp=0x7f7fcbfffeb0 pc=0x43cf71
goroutine 34 [runnable]:
main.SignalDuringExec.func2()
/tmp/workdir/go/src/runtime/testdata/testprognet/signalexec.go:47 fp=0xc00009d7e0 sp=0xc00009d7d8 pc=0x4d4ec0
runtime.goexit()
/tmp/workdir/go/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc00009d7e8 sp=0xc00009d7e0 pc=0x463261
created by main.SignalDuringExec
/tmp/workdir/go/src/runtime/testdata/testprognet/signalexec.go:47 +0x196
___lwp_park60
is the generic syscall for a thread to sleep waiting until another thread has woken it with _lwp_unpark
. It's used by pthread_mutex, pthread_cond, rtld's internal locks, and golang's semasleep/semawakeup -- pretty much all thread synchronization primitives that aren't connected to I/O will use ___lwp_park60
and _lwp_unpark
.
In recent memory, all bugs with missed wakeups that manifested as threads stuck in ___lwp_park60
were a result of mistakes in userland -- sometimes very tricky mistakes like the libpthread/rtld interaction in which the actions rtld took to resolve the libc ___lwp_park60
symbol had the effect of consuming a wakeup that should have caused the ___lwp_park60
syscall to return immediately. Not dispositive, but it would be surprising to see a bug in the syscall.
Can you run crash(8) to get a stack trace for the thread waiting in wait6, using the ps and bt/a commands?
Done. I was able to reproduce a hang by running exec.test
in a loop overnight, built from a clean checkout of the go
repo at commit f2acc607f8a6f2837ed302cedd70676d8ea1f463. The output from crash
is in https://gist.github.com/bcmills/c6a33bf8ce39e7b017891d9b8bc176f9, and I also attached to the relevant processes using gdb
to dump userland stacks.
(I also have a log from ktrace -i
/ kdump
, but it's >700 MiB so I'm not going to post it preemptively.)
From what I can tell, this reproducer is consistent with the libpthread
deadlock you described in https://github.com/golang/go/issues/50138#issuecomment-1276759706. This particular hang has the structure:
exec.test
process itself (PID 28501) seems healthy.
- It has one thread blocked on a wait6
call for a go build
subprocess (PID 24110) in TestExtraFiles
.go build
subprocess appears to be deadlocked: all of the threads are blocked on ___lwp_park60
.
pthread_detach
, and thread 8 is parked in pthread_create
.And, the kicker: when I detached the gdb
session that I used to dump the thread stacks for the go build
subprocess, it unblocked and the exec.test
run completed — exactly like what I observed with the C reproducer. 😅
(Assigning to @bsiegert because this should be fixed as part of #54773.)
Interesting! I thought golang used statically linked executables exclusively, so it didn't occur to me that the rtld/libpthread bug could apply here. But the stack traces in https://gist.github.com/bcmills/c6a33bf8ce39e7b017891d9b8bc176f9 show rtld and libpthread, as does readelf -a go
.
Pure-Go binaries are usually statically linked, but cgo binaries are often dynamically linked. Both the os/exec
test and the go
command are cgo binaries (due to e.g. the use of the net
package).
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && `^\s*runtime\.semacquire.*\n\s*\t.+\n\s*runtime\.gc(Start|MarkDone)`
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Change https://go.dev/cl/450615 mentions this issue: go: use testenv.Command instead of exec.Command in tests
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Updated #55166 to omit the netbsd-.*-9_0
builders now that the 9_3
builders are up.
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
post <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Those more recent x/tools
failures are a different root cause (fallout from #47257, which may require the slower builders to set a longer timeout scale for tests that load compiled export data for all of std
).
Filed #56967 for the x/tools
timeouts.
Found new dashboard test flakes for:
#!watchflakes
default <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
Found new dashboard test flakes for:
#!watchflakes
default <- goos == "netbsd" && (`SIGQUIT` || `panic: test timed out`)
The above failure is also not symptomatic of this issue, which still appears to have been resolved by the builder upgrade. I'm going to remove it from the “Test Flakes” project to get watchflakes
to stop reopening it.
In CL 442478 I fixed (or at least attempted to fix) Go's use of the
wait6
system call on NetBSD for the(*os.Process).Wait
hang tracked in #50138.In the process of tracking that down, I observed a possible missed-wakeup bug or deadlock in a C-only reproducer program (https://github.com/golang/go/issues/50138#issuecomment-1276074607).
I am now seeing deadlocks in
runtime.gopark
on NetBSD builders and SlowBots, and I was able to reproduce the failure mode by runninggo test os/exec -timeout=10s
in a loop on the builder (go test os/exec
normally completes within about half a second), and that symptom is at least superficially similar to the hang in the C program (both involve hangs in semaphore-based wakeup paths).This is the goroutine dump I obtained (with
GOTRACEBACK=crash
):The signature of this failure appears to be: goroutines blocked in
runtime.gopark
, goroutines blocked inruntime.goparkunlock
, and at least one goroutine blocked insyscall
(but not always thewait6
system call specifically).(attn @golang/netbsd; CC @golang/runtime)