Closed mikioh closed 8 years ago
/CC @aclements (from the dying message invalid p state
and #10240)
I just hit this as well
go version devel +a4650a2 Mon Apr 11 21:38:18 2016 +0000 linux/amd64
# github.com/juju/juju/provider/joyent
releasep: m=0xc820062000 m->p=0xc82001f500 p->m=859530731520 m->mcache=0x7ff2acede4b0 p->mcache=0x7ff2acede4b0 p->status=3
fatal error: releasep: invalid p state
runtime stack:
runtime.throw(0x12ce3b2, 0x19)
/home/dfc/go/src/runtime/panic.go:566 +0x8b
runtime.releasep(0xc00000001)
/home/dfc/go/src/runtime/proc.go:3365 +0x208
runtime.gcstopm()
/home/dfc/go/src/runtime/proc.go:1760 +0x42
runtime.schedule()
/home/dfc/go/src/runtime/proc.go:2070 +0x105
runtime.park_m(0xc82008ea80)
/home/dfc/go/src/runtime/proc.go:2169 +0x11a
runtime.mcall(0x0)
/home/dfc/go/src/runtime/asm_amd64.s:240 +0x5b
This feels like a recent regression.
I suspect a4650a2111b2bb826ca64a13bdad9c96e3095e47, /cc @josharian
Ugh. Will take a look, although today promises to be busy. How reliably can you reproduce? Any hints for doing so? If reliably, can you confirm that it was the append changes?
It's not specific to Solaris. It did start right after a4650a2. It's not immediately obvious to me how that would have caused this, but statistically speaking, it's a likely candidate.
$ greplogs -l -md -dashboard -E 'releasep: invalid p state'
2016-04-11T22:35:22-d1feddb/linux-amd64-clang
2016-04-11T23:15:02-501ddf7/linux-amd64-kube
2016-04-12T01:07:46-73a0185/solaris-amd64-smartos
$ greplogs -l -dashboard -E 'releasep: invalid p state' | findflakes -paths
First observed d1feddb 11 Apr 22:35 2016 (11 commits ago)
Last observed 73a0185 12 Apr 01:07 2016 (6 commits ago)
4.7% chance failure is still happening
40% failure probability (3 of 6 commits)
Likely culprits:
40% d1feddb cmd/vet: improve documentation for flags, slightly
24% a4650a2 cmd/compile: avoid write barrier in append fast path
14% 32efa16 cmd/compile: added stats printing to stackalloc
9% 7e40627 cmd/compile: zero all three argstorage slots
5% 7f53391 cmd/compile: fix -N build
No known past failures
I'm puzzled too, although spidey sense says missing VARDEF. When I'm at a computer (or if one of you beats me to it) next, we can temporarily disable the inplace append code with a well-placed && false to get this stable again.
I can't seem to reproduce on darwin, which is going to make this a bear to track down. Mailed CL 21891 to disable in the meantime.
CL https://golang.org/cl/21891 mentions this issue.
I'll see if I can reproduce on linux/amd64.
That was surprisingly easy. It looks like it was indeed the append optimization. With the append optimization (811ebb6^) I get 3 failures out of 20 runs in stress -p 1 go test -short std
. Without the append optimization (811ebb6), I get 0 failures out of 20 runs. I'll let it keep going, but that's not likely to happen by chance.
0 failures out of 100 runs with the append optimization disabled.
I can't reproduce locally. Maybe my five year old laptop is too slow, or maybe it's darwin.
In any case, I'd appreciated it greatly if someone who can get this to fail reliably for them would try patching in CL 21964. Thanks!
Hmm. 24 "invalid p state" failures out of 1,476 runs with CL 21964. Plus 45 other failures. :(
I'm going to do an overnight run without CL 21964 to see if I can reproduce it without the append optimization in a large number of runs.
Ouch. Thanks so much for doing this, @aclements, much appreciated. I look forward to the results.
Thanks @aclements I was able to trigger this once, but haven't been able to trigger it again before the CL was rolled back.
@josharian it is possible that the problem doesn't manifest on osx because of dumb luck and or inadvertent memory barriers because of differences in other runtime interactions with the operating system.
0 "invalid p state" failures out of 2,787 runs without CL 21964 (3f66d8c, the parent of CL 21964). (20 other failures; mostly "bind: address already in use" at dial_test.go:753 in net.)
Thanks, @aclements. I have some new theories, but I clearly need to be able to test on my own now instead of bothering you. :)
I tried to reproduce on GCE with a small machine running jessie; after an overnight run of 218 with CL 21964, I have no failures. At the rate you reported, I should have seen a handful. I'll leave it running for a while longer, but suggestions/ideas for setting up to reproduce myself would be welcome.
I am set up to reproduce, but my reproduction rate is not as high as yours, @aclements. If it is easy (and only if it is easy), would you mind patching in CL 22197 and giving it a stress run as well? If not, no prob, I'll leave mind running overnight anyway.
My machine's busy running benchmarks, but I've queued up a stress test of master on it, which should run tonight.
My stress tests look good, and the CL has just been submitted, so I think you can unqueue it. (Or feel free to leave it.) Thanks again for all the fish.
On Tue, Apr 19, 2016 at 2:31 PM, Austin Clements notifications@github.com wrote:
My machine's busy running benchmarks, but I've queued up a stress test of master on it, which should run tonight.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/15246#issuecomment-212138035
Ran overnight on my machine with 0 failures (other than the flaky net failures) out of ~2300 runs.
On the solaris-amd64-smartos buildbot. See http://build.golang.org/log/3e4aa690c01a4123a508595aa248329b26499b18.
/CC looks like runtime/GC gatekeepers (who?). It expected _Prunning but got _Pgcstop.