Open piob-io opened 2 days ago
Related Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
More notes:
GODEBUG=asyncpremptoff=1
GOGC=off
CC @golang/runtime
Here's the head of the log file:
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.05s
Running `target/debug/test test`
fatal error: runtime: mcall called on m->g0 stack
runtime stack:
runtime.throw({0x721602?, 0x44f1ce?})
/usr/local/go/src/runtime/panic.go:1067 +0x48 fp=0x7f5eafdffd48 sp=0x7f5eafdffd18 pc=0x471d28
runtime.badmcall(0xc000182fc0?)
/usr/local/go/src/runtime/proc.go:539 +0x1f fp=0x7f5eafdffd68 sp=0x7f5eafdffd48 pc=0x43cb1f
runtime.badmcall(0x44f755)
<autogenerated>:1 +0x1f fp=0x7f5eafdffd80 sp=0x7f5eafdffd68 pc=0x47c35f
runtime: g 0: unexpected return pc for runtime.badmcall called from 0x7f5eafdffdd0
stack: frame={sp:0x7f5eafdffd68, fp:0x7f5eafdffd80} stack=[0x7f5eaf600250,0x7f5eafdffe50)
0x00007f5eafdffc68: 0x0000000000439237 <runtime.startpanic_m+0x0000000000000157> 0x000000000043baf4 <runtime.printstring+0x0000000000000054>
0x00007f5eafdffc78: 0x0000000000795a40 0x000000c000182e00
0x00007f5eafdffc88: 0x00007f5eafdffcc8 0x0000000000438f2b <runtime.fatalthrow.func1+0x000000000000006b>
0x00007f5eafdffc98: 0x000000c000182e00 0x0000000000471d28 <runtime.throw+0x0000000000000048>
0x00007f5eafdffca8: 0x00007f5eafdffd18 0x00007f5eafdffd18
0x00007f5eafdffcb8: 0x0000000000471d28 <runtime.throw+0x0000000000000048> 0x000000c000182e00
0x00007f5eafdffcc8: 0x00007f5eafdffd08 0x0000000000438e98 <runtime.fatalthrow+0x0000000000000058>
0x00007f5eafdffcd8: 0x00007f5eafdffce8 0x000000c000182e00
0x00007f5eafdffce8: 0x0000000000438ec0 <runtime.fatalthrow.func1+0x0000000000000000> 0x000000c000182e00
0x00007f5eafdffcf8: 0x0000000000471d28 <runtime.throw+0x0000000000000048> 0x00007f5eafdffd18
0x00007f5eafdffd08: 0x00007f5eafdffd38 0x0000000000471d28 <runtime.throw+0x0000000000000048>
0x00007f5eafdffd18: 0x00007f5eafdffd20 0x0000000000438a60 <runtime.throw.func1+0x0000000000000000>
0x00007f5eafdffd28: 0x0000000000721602 0x0000000000000024
0x00007f5eafdffd38: 0x00007f5eafdffd58 0x000000000043cb1f <runtime.badmcall+0x000000000000001f>
0x00007f5eafdffd48: 0x0000000000721602 0x000000000044f1ce <runtime.readyWithTime+0x000000000000006e>
0x00007f5eafdffd58: 0x00007f5eafdffd70 0x000000000047c35f <runtime.badmcall+0x000000000000001f>
0x00007f5eafdffd68: <0x000000c000182fc0 0x00007f5eafdffd78
0x00007f5eafdffd78: 0x00007f5eafdffdd0 >0x000000000044f755 <runtime.semrelease1+0x00000000000001b5>
0x00007f5eafdffd88: 0x0000000000971840 0x0000000000441adf <runtime.execute+0x000000000000017f>
0x00007f5eafdffd98: 0x0000000000000000 0x0000000000000000
0x00007f5eafdffda8: 0x0000000000971840 0x000000c0000b26c0
0x00007f5eafdffdb8: 0x0000000000971840 0x0000000000971850
0x00007f5eafdffdc8: 0x000000000096b138 0x00007f5eafdffe10
0x00007f5eafdffdd8: 0x000000000045c275 <runtime.readTrace0+0x00000000000001d5> 0x000000c0001c7180
0x00007f5eafdffde8: 0x0000000100000000 0x0000000000000000
0x00007f5eafdffdf8: 0x00007f5eb7f40000 0x0000000000000000
0x00007f5eafdffe08: 0x000000000096b138 0x00007f5eafdffe30
0x00007f5eafdffe18: 0x000000000045c045 <runtime.ReadTrace.func1+0x0000000000000025> 0x000000c000094f5f
0x00007f5eafdffe28: 0x000000c000094f60 0x000000c000094f28
0x00007f5eafdffe38: 0x0000000000477e0a <runtime.systemstack+0x000000000000004a> 0x0000000000000000
0x00007f5eafdffe48: 0x01ffffffffffff80
goroutine 65 gp=0xc0001c7180 m=8 mp=0xc00008b108 [running]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:479 +0x8 fp=0xc000094f38 sp=0xc000094f28 pc=0x477da8
runtime.ReadTrace()
/usr/local/go/src/runtime/trace.go:753 +0x73 fp=0xc000094fa0 sp=0xc000094f38 pc=0x45bfd3
runtime/trace.Start.func1()
/usr/local/go/src/runtime/trace/trace.go:130 +0x45 fp=0xc000094fe0 sp=0xc000094fa0 pc=0x6901a5
runtime.goexit({})
/usr/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000094fe8 sp=0xc000094fe0 pc=0x479dc1
created by runtime/trace.Start in goroutine 11
/usr/local/go/src/runtime/trace/trace.go:128 +0xcf
ReadTrace
switches to the system stack to run readTrace0
, which has two semrelease(&trace.doneSema[gen%2])
calls, with semrelease
being a trivial wrapper around semrelease1
.
Maybe this is the goyield
call in semrelease1
. It's right after a readyWithTime
call, and gated (only) by if s.ticket == 1 && getg().m.locks == 0 {
. goyield
then does an mcall
.
I'm not sure at the moment if systemstack
increments mp.locks
. If not, this code path could be the problem, and so semrelease1
should have an additional getg() == getg().m.curg
gate before calling goyield
.
Nice catch @rhysh! I can confirm that the below patch makes what was a reliable repro that ~always hit within 1,000 runs to not happening in 10,000 runs. I'll leave it to the runtime experts to decide whether that's the appropriate fix.
diff --git a/src/runtime/sema.go b/src/runtime/sema.go
index f6b1b84f5f..26c33f578c 100644
--- a/src/runtime/sema.go
+++ b/src/runtime/sema.go
@@ -247,7 +247,7 @@ func semrelease1(addr *uint32, handoff bool, skipframes int) {
s.ticket = 1
}
readyWithTime(s, 5+skipframes)
- if s.ticket == 1 && getg().m.locks == 0 {
+ if s.ticket == 1 && getg().m.locks == 0 && getg() == getg().m.curg {
// Direct G handoff
// readyWithTime has added the waiter G as runnext in the
// current P; we now call the scheduler so that we start running
I don't have an explanation for how s.ticket == 1
would be true in the crash report. It looks like we know a few lines up that it's zero, and only becomes non-zero if handoff
is true. But readTrace0
's calls to semrelease1
are via the semrelease
wrapper, which passes false for handoff
.
Looks like @mknyszek is on it, I'll leave it there.
I added a bit of logging to the point that before readyWithTime
is called, s.ticket is 0, and after it is 1. :popcorn:
Go version
go version go1.23.3 linux/amd64
Output of
go env
in your module/workspace:What did you do?
Run the attached go program, which does some computation, and exposes profiling endpoints.
Fetch the cpu profile, and while it is being captured, fetch execution trace. Specifically to programmatically assure that cpu profile collection has started, hit the profile endpoint twice, expect one to fail with 500 error (only one profile can be captured concurrently), and then immediately start capturing the trace.
Repeat many times to reproduce the issue.
Attached is the go program code, and a rust binary that will repeatedly launch this program, and hit the profiling endpoints. I tried writing a go program to fetch the profiles, but it doesn't seem to repro, I suspect that race condition is subtle. The rust program has a concurrency target set to 32, problem seems to repro when all cores are busy.
Unpack the zip and: repro.zip $ go build fib.go $ cargo run test
What did you see happen?
Sometimes the go program ends up crashing.
Output: err.txt
What did you expect to see?
Successfully produced profiles without crashing.