golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.97k stars 17.67k forks source link

runtime: fatal error: releasep: invalid p state #15246

Closed mikioh closed 8 years ago

mikioh commented 8 years ago

On the solaris-amd64-smartos buildbot. See http://build.golang.org/log/3e4aa690c01a4123a508595aa248329b26499b18.

ok      log 0.018s
ok      log/syslog  2.047s
ok      math    0.029s
releasep: m=0xc82044a000 m->p=0xc820022000 p->m=859534827520 m->mcache=0xfffffd7fff131a30 p->mcache=0xfffffd7fff131a30 p->status=3
fatal error: releasep: invalid p state

runtime stack:
runtime.throw(0x634c35, 0x19)
    /root/work/solaris-amd64-smartos-73a0185ad390/go/src/runtime/panic.go:566 +0x8b
runtime.releasep(0x2000)
    /root/work/solaris-amd64-smartos-73a0185ad390/go/src/runtime/proc.go:3365 +0x208
runtime.gcstopm()
    /root/work/solaris-amd64-smartos-73a0185ad390/go/src/runtime/proc.go:1760 +0x42
runtime.schedule()
    /root/work/solaris-amd64-smartos-73a0185ad390/go/src/runtime/proc.go:2070 +0x105
runtime.park_m(0xc82044d380)
    /root/work/solaris-amd64-smartos-73a0185ad390/go/src/runtime/proc.go:2169 +0x11a
runtime.mcall(0x0)
    /root/work/solaris-amd64-smartos-73a0185ad390/go/src/runtime/asm_amd64.s:240 +0x64
(snip)

/CC looks like runtime/GC gatekeepers (who?). It expected _Prunning but got _Pgcstop.

mikioh commented 8 years ago

/CC @aclements (from the dying message invalid p state and #10240)

davecheney commented 8 years ago

I just hit this as well

go version devel +a4650a2 Mon Apr 11 21:38:18 2016 +0000 linux/amd64

# github.com/juju/juju/provider/joyent
releasep: m=0xc820062000 m->p=0xc82001f500 p->m=859530731520 m->mcache=0x7ff2acede4b0 p->mcache=0x7ff2acede4b0 p->status=3
fatal error: releasep: invalid p state

runtime stack:
runtime.throw(0x12ce3b2, 0x19)
        /home/dfc/go/src/runtime/panic.go:566 +0x8b
runtime.releasep(0xc00000001)
        /home/dfc/go/src/runtime/proc.go:3365 +0x208
runtime.gcstopm()
        /home/dfc/go/src/runtime/proc.go:1760 +0x42
runtime.schedule()
        /home/dfc/go/src/runtime/proc.go:2070 +0x105
runtime.park_m(0xc82008ea80)
        /home/dfc/go/src/runtime/proc.go:2169 +0x11a
runtime.mcall(0x0)
        /home/dfc/go/src/runtime/asm_amd64.s:240 +0x5b

This feels like a recent regression.

davecheney commented 8 years ago

I suspect a4650a2111b2bb826ca64a13bdad9c96e3095e47, /cc @josharian

josharian commented 8 years ago

Ugh. Will take a look, although today promises to be busy. How reliably can you reproduce? Any hints for doing so? If reliably, can you confirm that it was the append changes?

aclements commented 8 years ago

It's not specific to Solaris. It did start right after a4650a2. It's not immediately obvious to me how that would have caused this, but statistically speaking, it's a likely candidate.

$ greplogs -l -md -dashboard -E 'releasep: invalid p state' 2016-04-11T22:35:22-d1feddb/linux-amd64-clang 2016-04-11T23:15:02-501ddf7/linux-amd64-kube 2016-04-12T01:07:46-73a0185/solaris-amd64-smartos

$ greplogs -l -dashboard -E 'releasep: invalid p state' | findflakes -paths
First observed d1feddb 11 Apr 22:35 2016 (11 commits ago)
Last observed  73a0185 12 Apr 01:07 2016 (6 commits ago)
4.7% chance failure is still happening
40% failure probability (3 of 6 commits)
Likely culprits:
   40% d1feddb cmd/vet: improve documentation for flags, slightly
   24% a4650a2 cmd/compile: avoid write barrier in append fast path
   14% 32efa16 cmd/compile: added stats printing to stackalloc
    9% 7e40627 cmd/compile: zero all three argstorage slots
    5% 7f53391 cmd/compile: fix -N build
No known past failures
josharian commented 8 years ago

I'm puzzled too, although spidey sense says missing VARDEF. When I'm at a computer (or if one of you beats me to it) next, we can temporarily disable the inplace append code with a well-placed && false to get this stable again.

josharian commented 8 years ago

I can't seem to reproduce on darwin, which is going to make this a bear to track down. Mailed CL 21891 to disable in the meantime.

gopherbot commented 8 years ago

CL https://golang.org/cl/21891 mentions this issue.

aclements commented 8 years ago

I'll see if I can reproduce on linux/amd64.

aclements commented 8 years ago

That was surprisingly easy. It looks like it was indeed the append optimization. With the append optimization (811ebb6^) I get 3 failures out of 20 runs in stress -p 1 go test -short std. Without the append optimization (811ebb6), I get 0 failures out of 20 runs. I'll let it keep going, but that's not likely to happen by chance.

aclements commented 8 years ago

0 failures out of 100 runs with the append optimization disabled.

josharian commented 8 years ago

I can't reproduce locally. Maybe my five year old laptop is too slow, or maybe it's darwin.

In any case, I'd appreciated it greatly if someone who can get this to fail reliably for them would try patching in CL 21964. Thanks!

aclements commented 8 years ago

Hmm. 24 "invalid p state" failures out of 1,476 runs with CL 21964. Plus 45 other failures. :(

I'm going to do an overnight run without CL 21964 to see if I can reproduce it without the append optimization in a large number of runs.

josharian commented 8 years ago

Ouch. Thanks so much for doing this, @aclements, much appreciated. I look forward to the results.

davecheney commented 8 years ago

Thanks @aclements I was able to trigger this once, but haven't been able to trigger it again before the CL was rolled back.

davecheney commented 8 years ago

@josharian it is possible that the problem doesn't manifest on osx because of dumb luck and or inadvertent memory barriers because of differences in other runtime interactions with the operating system.

aclements commented 8 years ago

0 "invalid p state" failures out of 2,787 runs without CL 21964 (3f66d8c, the parent of CL 21964). (20 other failures; mostly "bind: address already in use" at dial_test.go:753 in net.)

josharian commented 8 years ago

Thanks, @aclements. I have some new theories, but I clearly need to be able to test on my own now instead of bothering you. :)

I tried to reproduce on GCE with a small machine running jessie; after an overnight run of 218 with CL 21964, I have no failures. At the rate you reported, I should have seen a handful. I'll leave it running for a while longer, but suggestions/ideas for setting up to reproduce myself would be welcome.

josharian commented 8 years ago

I am set up to reproduce, but my reproduction rate is not as high as yours, @aclements. If it is easy (and only if it is easy), would you mind patching in CL 22197 and giving it a stress run as well? If not, no prob, I'll leave mind running overnight anyway.

aclements commented 8 years ago

My machine's busy running benchmarks, but I've queued up a stress test of master on it, which should run tonight.

josharian commented 8 years ago

My stress tests look good, and the CL has just been submitted, so I think you can unqueue it. (Or feel free to leave it.) Thanks again for all the fish.

On Tue, Apr 19, 2016 at 2:31 PM, Austin Clements notifications@github.com wrote:

My machine's busy running benchmarks, but I've queued up a stress test of master on it, which should run tonight.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/15246#issuecomment-212138035

aclements commented 8 years ago

Ran overnight on my machine with 0 failures (other than the flaky net failures) out of ~2300 runs.