golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.7k stars 17.5k forks source link

runtime/pprof: TestProfilerStackDepth/mutex failures #68562

Open gopherbot opened 1 month ago

gopherbot commented 1 month ago
#!watchflakes
default <- pkg == "runtime/pprof" && test == "TestProfilerStackDepth/mutex"

Issue created automatically to collect these failures.

Example (log):

=== RUN   TestProfilerStackDepth/mutex
panic: test timed out after 3m0s
    running tests:
        TestProfilerStackDepth (2m44s)
        TestProfilerStackDepth/mutex (2m43s)

goroutine 26723 gp=0x1d158c8 m=14 mp=0x1a80508 [running]:
panic({0x5ebd60, 0x192e030})
    C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:804 +0x137 fp=0x1b54f84 sp=0x1b54f30 pc=0x4b00c7
testing.(*M).startAlarm.func1()
...
runtime/pprof.TestProfilerStackDepth.func2(0x18024b8)
    C:/b/s/w/ir/x/w/goroot/src/runtime/pprof/pprof_test.go:2472 +0x80 fp=0x1857fb8 sp=0x1857eec pc=0x5c2ba0
testing.tRunner(0x18024b8, 0x18cabe8)
    C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1690 +0xc7 fp=0x1857fe4 sp=0x1857fb8 pc=0x534177
testing.(*T).Run.gowrap1()
    C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x2c fp=0x1857ff0 sp=0x1857fe4 pc=0x53512c
runtime.goexit({})
    C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x1857ff4 sp=0x1857ff0 pc=0x4b6f51
created by testing.(*T).Run in goroutine 26684
    C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x3bb

watchflakes

gopherbot commented 1 month ago

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "runtime/pprof" && test == "TestProfilerStackDepth/mutex"
2024-07-23 21:29 gotip-windows-386 go@c18ff292 runtime/pprof.TestProfilerStackDepth/mutex [ABORT] (log) === RUN TestProfilerStackDepth/mutex panic: test timed out after 3m0s running tests: TestProfilerStackDepth (2m44s) TestProfilerStackDepth/mutex (2m43s) goroutine 26723 gp=0x1d158c8 m=14 mp=0x1a80508 [running]: panic({0x5ebd60, 0x192e030}) C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:804 +0x137 fp=0x1b54f84 sp=0x1b54f30 pc=0x4b00c7 testing.(*M).startAlarm.func1() ... runtime/pprof.TestProfilerStackDepth.func2(0x18024b8) C:/b/s/w/ir/x/w/goroot/src/runtime/pprof/pprof_test.go:2472 +0x80 fp=0x1857fb8 sp=0x1857eec pc=0x5c2ba0 testing.tRunner(0x18024b8, 0x18cabe8) C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1690 +0xc7 fp=0x1857fe4 sp=0x1857fb8 pc=0x534177 testing.(*T).Run.gowrap1() C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x2c fp=0x1857ff0 sp=0x1857fe4 pc=0x53512c runtime.goexit({}) C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x1857ff4 sp=0x1857ff0 pc=0x4b6f51 created by testing.(*T).Run in goroutine 26684 C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x3bb

watchflakes

mknyszek commented 1 month ago

I spent some time looking into this failure, and it's clearly some kind of GC deadlock. Notably, the goroutine trying to write out the mutex profile calls into malloc, then into GC assists, and gets stuck somewhere in assists. Unfortunately it's unclear where. Some candidates include:

The lack of any smoking guns in the stack dump suggests to me that sysmon may be involved. Or, it's some kind of memory corruption. I also wonder if this is a platform-specific failure; windows-386 is a bit funky.

A final note is all the GC mark workers in the stack dump are parked. This kinda makes sense -- the GC is not done, but they can't find any work to do because in 2 minutes they almost certainly found everything they could aside from wherever that one goroutine is stuck.

mknyszek commented 1 month ago

Actually, the assist queue lock theory is not possible. We know for a fact we're in gcAssistAlloc1 since we're on the system stack and the line numbers match up for us to have entered the system stack there.

So, the likely explanation is difficulty preempting some goroutine (I can't think of any other reason we'd block so long...).

mknyszek commented 1 month ago

I have a new possible lead/theory. This test is somewhat special in that the mutex profile rate is 1. What if there's a place where we're acquiring and/or releasing a runtime lock, somewhere deep in gcDrainN, where we end up recording that lock in the mutex profile and blocking somewhere in there. I don't immediately see why that would cause a deadlock, but since this is a new kind of failure, recent changes (such as in the Go 1.23 cycle) may be relevant.

Though, this doesn't really line up with the evidence, since there should be some other goroutine waiting on a lock. Maybe it's a self deadlock? The write-mutex-profile code is interacting with the same locks that the runtime unlock path is, though I don't immediately see how we could self-deadlock in exactly the place we're stuck. Even if we were holding a lock (which it doesn't look like we are), that should prevent recording further locks.

mknyszek commented 1 month ago

In the last triage meeting, I said we'd just wait for more failures. So, setting the WaitingForInfo label.