Closed fischerman closed 10 months ago
Seems to only affect 1.21.x, I was unable to reproduce it in 1.20 or tip. Reproduced on both darwin/arm64 and linux/amd64
This message in the stacktrace caught my attention: traceback: unexpected SPWRITE function runtime.morestack
cc @golang/compiler
@mauri870 CL https://go.dev/cl/531815 fixes the "unexpected SPWRITE" message. But that is just a message, unrelated to the original bad pointer bug.
That sounded like a red herring, thanks for clarifying it.
The program starts failing since https://go-review.googlesource.com/c/go/+/270940, then "fixed" after https://go-review.googlesource.com/c/go/+/517775
Kindly cc @randall77 and @mdempsky to decide what should we do.
This looks like a bad reordering of a nil pointer check and subsequent pointer arithmetic.
9542a0: e8 5b fe ff ff call 954100 <github.com/fischerman/invalid-stack-pointer.f>
9542a5: 48 89 44 24 30 mov %rax,0x30(%rsp)
f
returns nil, result is written to 0x30(SP)
.
9542aa: 48 8d 05 ef 4b 05 00 lea 0x54bef(%rip),%rax # 9a8ea0 <type:*+0x53ea0>
9542b1: e8 0a 14 af ff call 4456c0 <runtime.newobject>
9542b6: 48 89 44 24 40 mov %rax,0x40(%rsp)
9542bb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
9542c0: e8 fb 52 b5 ff call 4a95c0 <runtime.racewrite>
9542c5: 48 8b 4c 24 40 mov 0x40(%rsp),%rcx
9542ca: 48 c7 41 08 21 00 00 movq $0x21,0x8(%rcx)
9542d1: 00
start a write barrier to initialize new object allocated above. (In non-race mode I think this barrier is not needed, as it is writing a constant string pointer to known-zeroed memory. Alas, I think in race mode the zeroed-ness of the allocation is not detectable.)
9542d2: 83 3d 67 2e 47 00 00 cmpl $0x0,0x472e67(%rip) # dc7140 <runtime.writeBarrier>
9542d9: 74 0d je 9542e8 <github.com/fischerman/invalid-stack-pointer.glob..func1.1+0x68>
9542db: 48 8b 11 mov (%rcx),%rdx
9542de: 66 90 xchg %ax,%ax
9542e0: e8 1b 3b b5 ff call 4a7e00 <runtime.gcWriteBarrier1>
9542e5: 49 89 13 mov %rdx,(%r11)
Write barrier is done here, now to do the actual write. But first there are some other instructions!
It's calculating &g.Listeners
for use by a subsequent raceread
call. I have no idea why these instructions appear here - they should not be before the write associated with the barrier. And in any case, they should be after the nil check below. The bad pointer is in 0x38(SP)
and that's where the stack copier finds it and barfs.
9542e8: 48 8b 54 24 30 mov 0x30(%rsp),%rdx
9542ed: 48 8d 5a 10 lea 0x10(%rdx),%rbx
9542f1: 48 89 5c 24 38 mov %rbx,0x38(%rsp)
These 2 instructions are the actual write.
9542f6: 48 8d 35 f8 43 0c 00 lea 0xc43f8(%rip),%rsi # a186f5 <go:string.*+0xf39d>
9542fd: 48 89 31 mov %rsi,(%rcx)
Some uninteresting stuff follows.
954300: 48 8d 05 21 71 19 00 lea 0x197121(%rip),%rax # aeb428 <go:itab.*errors.errorString,error+0x8>
954307: e8 54 52 b5 ff call 4a9560 <runtime.raceread>
95430c: 48 8b 05 15 71 19 00 mov 0x197115(%rip),%rax # aeb428 <go:itab.*errors.errorString,error+0x8>
954313: 48 8b 5c 24 40 mov 0x40(%rsp),%rbx
954318: 31 c9 xor %ecx,%ecx
95431a: 31 ff xor %edi,%edi
95431c: 48 89 fe mov %rdi,%rsi
95431f: 90 nop
954320: e8 7b fa ff ff call 953da0 <github.com/onsi/gomega.Expect>
954325: 48 8b 48 20 mov 0x20(%rax),%rcx
954329: 48 89 d8 mov %rbx,%rax
95432c: 48 8d 1d 05 9c 19 00 lea 0x199c05(%rip),%rbx # aedf38 <go:itab.*github.com/onsi/gomega/matchers.HaveOccurredMatcher,github.com/onsi/gomega/types.GomegaMatcher>
954333: 31 ff xor %edi,%edi
954335: 31 f6 xor %esi,%esi
954337: 49 89 f0 mov %rsi,%r8
95433a: 48 89 ca mov %rcx,%rdx
95433d: 48 8d 0d fc 27 47 00 lea 0x4727fc(%rip),%rcx # dc6b40 <runtime.zerobase>
954344: ff d2 call *%rdx
The nil pointer check of the result of f
has made it all the way down here.
954346: 48 8b 4c 24 30 mov 0x30(%rsp),%rcx
95434b: 84 01 test %al,(%rcx)
Then we actually read g.Listeners
. We only compute &g.Listeners
to pass to raceread
, another reason this only happens in -race
mode.
95434d: 48 8b 44 24 38 mov 0x38(%rsp),%rax
954352: e8 09 52 b5 ff call 4a9560 <runtime.raceread>
954357: 48 8b 4c 24 30 mov 0x30(%rsp),%rcx
95435c: 48 8b 59 10 mov 0x10(%rcx),%rbx
Reminds me of #42673, which is what CL 270940 was supposed to fix, not cause. Somehow the nil check just isn't getting the priority it needs to come before the address calculation &g.Listener
.
Here's a reproducer. At least, you can see the problem in assembly. (It will need a wrapper to make it into a test.)
package main
type T struct {
a, b int
}
func f() {
x := p()
gb = gi != 0
q()
g(&x.b)
}
func g(p *int)
func q()
func p() *T
var gi int
var gb bool
When compiled (no -race
required), we get this:
0x000e 00014 (tmp2.go:11) CALL main.p(SB)
0x0013 00019 (tmp2.go:11) MOVQ AX, main.x+8(SP)
0x0018 00024 (tmp2.go:14) LEAQ 8(AX), CX
0x001c 00028 (tmp2.go:14) MOVQ CX, main..autotmp_2+16(SP)
0x0021 00033 (tmp2.go:12) CMPQ main.gi(SB), $0
0x0029 00041 (tmp2.go:12) SETNE main.gb(SB)
0x0030 00048 (tmp2.go:13) CALL main.q(SB)
0x0035 00053 (tmp2.go:14) MOVQ main.x+8(SP), AX
0x003a 00058 (tmp2.go:14) TESTB AL, (AX)
0x003c 00060 (tmp2.go:14) MOVQ main..autotmp_2+16(SP), AX
0x0041 00065 (tmp2.go:14) CALL main.g(SB)
Note the LEAQ
happens before the call to q
and its result is spilled to the stack. The nil check doesn't happen until just before the call to g
.
The trick with the gb = gi != 0
is to introduce a low-priority instruction to the scheduler. Flags-generating instructions are low-priority (we want them to issue as late as possible to minimize the result's lifetime). This causes a priority inversion, as the nil check is dependent on that low-priority instruction, so it gets delayed as well. The LEAQ
is standard priority, so it gets to go first.
@gopherbot Please open a backport issue for 1.21.
Backport issue(s) opened: #63743 (for 1.21).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.
Change https://go.dev/cl/537775 mentions this issue: cmd/compile: ensure pointer arithmetic happens after the nil check
Change https://go.dev/cl/538595 mentions this issue: cmd/compile: handle constant pointer offsets in dead store elimination
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, as of this writing "go1.21.3".
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Run
go test -race .
in this go module.I couldn't reproduce it without dependencies, but the code is quiet small. I'm only "using" Ginkgo and Gomega. Of another dependency I'm just using a type which only ever set to
nil
. I've added some comments.The error only occurs in Go 1.21. Go 1.20 works fine. Also the
-race
flag is required.When I remove dead code the panic doesn't occur.
What did you expect to see?
Ginkgo test results.
What did you see instead?