Open tamird opened 4 years ago
@hyangah
I agree that at face value this looks like a regression of #35538, likely in http://golang.org/cl/221577.
http://golang.org/cl/221577 was merged into the Fuchsia tree in https://fuchsia-review.googlesource.com/c/third_party/go/+/374435 (April 4).
However, https://fxbug.dev/52575 was filed on May 13 just a few hours after https://fuchsia-review.googlesource.com/c/third_party/go/+/389033 (Go 1.14.3 merge), which IMO strongly implicates something in that merge.
@tamird do you know if the May 13 bug was really the first cases of this you saw, or was there perhaps some earlier bug that's not attached?
Yes, there were earlier bugs; sorry to have mislead you!
https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=44718 was filed on January 24 https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=47563 was filed on March 3
Ah, interesting.
https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=44718
I don't recognize this crash (log, search for pprof) (@hyangah maybe you do?), but it does not seem related to this bug.
https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=47563
This crash looks like #38605. The May 13 merge included the fix for this. Though it is not clear to me if the appendLocsForStack panics were really new after that merge or the other allocToCache crash was just so much more frequent that we never saw the appendLocsForStack panic.
Ah, interesting.
https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=44718
I don't recognize this crash (log, search for pprof) (@hyangah maybe you do?), but it does not seem related to this bug.
TestBlockProfile failed with debug=1
option. That's not the code path that exercises the proto encoding code for debug>0. Not sure why the block profile does not include the expected channel close event, but it's not a related bug to appendLocsForStack.
https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=47563
This crash looks like #38605. The May 13 merge included the fix for this. Though it is not clear to me if the appendLocsForStack panics were really new after that merge or the other allocToCache crash was just so much more frequent that we never saw the appendLocsForStack panic.
Change https://golang.org/cl/248728 mentions this issue: runtime/pprof: add logging for debugging issue 40823
With the change above applied, here's what we see:
=== RUN TestCPUProfileMultithreaded
stk:# 0xde474a80eb runtime.nanotime1+0x10b /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/os_fuchsia.go:478
# 0xde474b1f83 runtime._System+0x3 /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/proc.go:3911
l.pcs:# 0xde474a80eb runtime.nanotime1+0x10b /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/os_fuchsia.go:478
# 0xde474a80e4 runtime.nanotime+0x104 /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/time_nofake.go:19
# 0xde474a8050 runtime.suspendG+0x70 /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/preempt.go:252
panic: runtime error: slice bounds out of range [3:2]
goroutine 6 [running]:
runtime/pprof.(*profileBuilder).appendLocsForStack(0x1fd1c0b3340, 0x1fd1c130840, 0x0, 0x8, 0x1fd1c216130, 0x2, 0x2, 0x0, 0x6, 0x8)
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:415 +0x640
runtime/pprof.(*profileBuilder).build(0x1fd1c0b3340)
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:366 +0xf4
runtime/pprof.profileWriter(0xde4768f8c0, 0x1fd1c10a120)
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:813 +0xd4
created by runtime/pprof.StartCPUProfile
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:784 +0x114
Ah, this is very helpful. It looks like this might be a bug in runtime.expandFinalInlineFrame.
From the PCs, I suspect that nanotime1 and nanotime are inlined into suspendG, so expandFinalInlineFrame should have expanded the stack to include nanotime and suspendG, but it seems that it didn't for some reason.
Could you provide a copy of the crashing binary, or tell me how to build it so I can take a closer look?
So I think the problem here is that expandFinalInlineFrame doesn't account for the magic _System, _GC, etc frames. It should expand the second frame if one of those appear at the bottom. I'll send out a patch for this.
That said, I don't understand why the traceback would be failing here: https://fuchsia.googlesource.com/third_party/go/+/refs/heads/master/src/runtime/proc.go#4069
Regarding why the traceback is failing in the first place:
Since nanotime1 does call into the VDSO, it is possible there is some interaction with the handling of vdsoPC/vdsoSP or some weirdness with how profiling signals are delivered on Fuchsia (haven't looked at that much), though I'll want to take a closer at the binary.
As an aside, you all will want to port http://golang.org/cl/246763 for Fuchsia. I don't think that is the direct cause of these crashes, but is certainly suspicious.
If you can reproduce this issue, I think it would be valuable to see if making nanotime1 reentrant fixes the issue, to rule that out. I wrote https://fuchsia-review.googlesource.com/c/third_party/go/+/419640, which I believe is what you need, though it is untested since I don't have a Fuchsia environment.
I haven't been able to reproduce locally, but the issue only appears on arm64. The previous file I provided was the amd64 binary; here's the arm64: https://gofile.io/d/xPjFh3.
Reproducing only on ARM64 makes me wonder if this is a memory ordering issue.
VDSO calls set m.vdsoPC
followed by m.vdsoSP
. This order is important, because if sigprof
observes that m.vdsoSP
is set, then it assumes m.vdsoPC
is valid. On Unix, sigprof
is running on the same thread as the m
it is profiling, therefore no memory synchronization is necessary.
However, on Fuchsia it is running on a different thread. Thus given that ARM allows stores to be reordered after stores, sigprof
may actually observe non-zero m.vdsoSP
without a valid m.vdsoPC
.
Before calling sigprof
, we suspend the target thread with zx_task_suspend_token
. Does that call include a complete memory barrier to ensure we see fully-consistent values? The docs don't specify.
On AMD64, stores are ordered, so without a memory barrier we will still see consistent values (though perhaps stale!).
I see that the Windows port has a very similar structure, though there is no Windows ARM64 port, and I don't know if _SuspendThread
includes memory barrier either.
cc @aclements @mknyszek to see if this sounds right, plus if you know more about Windows.
@prattmic Your theory seems plausible to me, though I'm still tracking down how "bad m.vdsoPC
" precisely translates into this crash. I suppose what this means is the gentraceback
based on mp.vdsoPC
and mp.vdsoSP
will fail, we'll fall into the next case (the one you mentioned earlier with funcPC(_System)
) and that's why stk
gets populated the way it does.
Oh, I see, in an earlier post you posit that nanotime
and nanotime1
are inlined into suspendG
at the PC we encounter, so after expandFinalInlineFrame
the _System
is the extra element causing the crash. Have you confirmed yet that they got inlined? If so, I can see this all connecting together. For this inlined case the profiling code implicitly expects not to find a symbol that's marked as inline if it was marked with the pseudo-symbol _System
, but because we observe a bad vdsoPC
we end up falling into the wrong case (actually, any ordering where they're not both observed to be set could lead us to drop down into that case, making this somewhat likelier).
If this is the problem then we need either (1) synchronization of vdsoPC
and vdsoSP
or (2) make the profiling code be able to handle this extra _System
element in the inlined case (which is what you suggested earlier), though it means some stacks might not be attributed "correctly." Does all this align with your thinking @prattmic?
Windows is the only platform (AFAIK) we officially support that has the "suspend thread at a distance" behavior, everything else uses signals. We had a windows-arm
port (not arm64
, as you point out) going at some point, but the builder was routinely broken.
Yes, nanotime
and nanotime1
are inlined in the binary from https://github.com/golang/go/issues/40823#issuecomment-678767293. It looks like:
:0 0x7b0d8 390083e1 MOVB R1, 32(RSP)
:0 0x7b0dc b4000284 CBZ R4, 20(PC)
:0 0x7b0e0 f90027e5 MOVD R5, 72(RSP)
:0 0x7b0e4 d503201f NOOP
:0 0x7b0e8 9400d2de CALL syscall/zx.Sys_clock_get_monotonic(SB)
:0 0x7b0ec f94007e0 MOVD 8(RSP), R0
:0 0x7b0f0 f94027e1 MOVD 72(RSP), R1
:0 0x7b0f4 eb01001f CMP R1, R0
:0 0x7b0f8 540000ca BGE 6(PC)
:0 0x7b0fc d2800140 MOVD $10, R0
:0 0x7b100 b9000be0 MOVW R0, 8(RSP)
:0 0x7b104 9400ad4f CALL runtime.procyield(SB)
:0 0x7b108 f94027e0 MOVD 72(RSP), R0
:0 0x7b10c 17ffffc8 JMP -56(PC)
:0 0x7b110 9400b0dc CALL runtime.osyield(SB)
:0 0x7b114 d503201f NOOP
:0 0x7b118 9400d2d2 CALL syscall/zx.Sys_clock_get_monotonic(SB)
:0 0x7b11c f94007e0 MOVD 8(RSP), R0
:0 0x7b120 d282711b MOVD $5000, R27
:0 0x7b124 8b1b0000 ADD R27, R0, R0
:0 0x7b128 17ffffc1 JMP -63(PC)
...
Note that this isn't the exact same binary that crashed, but these instructions match up almost perfectly with the stack, so perhaps it is identical.
I think we ultimately need (1) as you suggest. (2) should stop the immediate crashing, but stacks will be misattributed, and I wonder if there are other odd cases we'll hit with odd memory ordering.
We could make these vdso fields explicitly atomics, or perhaps the ARM64 assembly should just manually include DMB
/DSB
instructions at the right point?
Of course, this is predicated on the assumption that zx_task_suspend_token
doesn't include an implicit memory barrier. I briefly looked at the Zircon source and didn't notice any, but I'm far from familiar. @tamird do you know, or know who would be a better authority on this?
w.r.t. Windows, I forgot that Windows has no VDSO and thus doesn't use vdsoSP
/vdsoPC
at all! That said, it has very similar libcallsp
/libcallpc
, which may have similar issues.
This makes sense. I believe there are implicit memory barriers in zx_object_wait_one
(due to locking) -- which is called to wait for the profiled thread to stop after the zx_task_suspend_token
but before sigprof
is called. My familiarity with ARM RMO is poor; wouldn't we also need a barrier between the store to vdsoPC
and vdsoSP
to guarantee remote stores are seen in-order? I think sequential consistency is easier to reason about; my vote would be to make these atomic given that those APIs are generally expected to make that guarantee.
Thanks everyone for the investigations, and for the patience. Punting to Go1.17.
We stopped seeing this for a while, but having just upgraded to go1.16rc1, it's back.
EDIT: correction, we stopped seeing it because we started skipping the previous occurrence. It seems that we now have a new occurrence.
Ouch, sorry about that @tamird, and thank you for the report. I shall be reinstating as a release blocker for Go1.16
I don't think this needs to block the Go 1.16 release. Given that Go 1.16 RC 1 is already out and the bar for fixes is very high, it doesn't seem that this can make it to Go 1.16. I'll reset the milestone to Backlog, and it can be updated when the timing of the fix is more clear.
The fence was added, and we're still seeing this come up in the same context.
I was never able to repro this on a 128-core, multi-node Cavium system with innumerable iterations over several days, even with cpusets that should've guaranteed at least half of the threads to run on a different node (and thus increase the likelihood of a memory ordering issue).
I'm not happy blaming this on qemu without actually understanding the problem, but I'm wondering if this is a similar issue to #9605. These ARM tests do run in a modern descendant of the qemu-based emulator that was being used in that issue, but that issue suggested this was flaky on real hardware, too. However, I have a hard time reconciling not being able to reproduce the issue once (in who-knows-how-many thousands of runs) on real hardware with it happening quite regularly (~hundreds or thousands of runs) in a qemuish arm64 env.
The benchmark that was producing this failure has been disabled until yesterday; having enabled it, we're now seeing this even on fuchsia-x64; e.g. this build.
stk:# 0x179ab65affb runtime.(*gcBits).bytep+0x13b /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/mheap.go:1938
# 0x179ab66cbe0 runtime._ExternalCode+0x0 /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/proc.go:4668
l.pcs:# 0x179ab65affb runtime.(*gcBits).bytep+0x13b /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/mheap.go:1938
# 0x179ab65afe3 runtime.(*gcBits).bitp+0x123 /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/mheap.go:1944
# 0x179ab65affc runtime.(*mspan).markBitsForIndex+0x13c /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/mbitmap.go:257
# 0x179ab65afe2 runtime.wbBufFlush1+0x122 /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/mwbbuf.go:266
panic: runtime error: slice bounds out of range [4:2]
goroutine 34 [running]:
runtime/pprof.(*profileBuilder).appendLocsForStack(0x5dac058000, {0x5dac102300, 0x0, 0x20}, {0x5dac49a450, 0x2, 0x2})
/b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:414 +0x893
runtime/pprof.(*profileBuilder).build(0x5dac058000)
/b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:365 +0x2c5
runtime/pprof.profileWriter({0x179aba3c500, 0x5dac108008})
/b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:813 +0x125
created by runtime/pprof.StartCPUProfile
/b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:784 +0x171
The problem here is two-fold (as previously partially discussed in https://github.com/golang/go/issues/40823#issuecomment-679311736):
expandFinalInlineFrame
fails to expand because the truncated stack is followed by _ExternalCode
, which it doesn't handle. expandFinalInlineFrame
could be adjusted to handle this case._ExternalCode
here at all. This is clearly in Go code. _ExternalCode
comes from sigprofNonGoPC
, which Fuchsia calls when profiling at https://fuchsia.googlesource.com/third_party/go/+/refs/heads/main/src/runtime/os_fuchsia.go#642. AFAICT, the logic here here is wrong: mp.curg
is not always set. Fuchsia should almost certainly more closely match the Windows logic, which handles more cases: https://cs.opensource.google/go/go/+/master:src/runtime/os_windows.go;l=1221-1223. I'll send a CL for this.I've mailed (untested!) https://fuchsia-review.googlesource.com/c/third_party/go/+/629462, which I hope fixes this.
We've been seeing this issue for a while; filing this bug now that we've upgraded to go1.15. (See https://fxbug.dev/52575).
This happens sometimes when running the runtime/pprof tests:
There's no diff between Fuchsia's fork and go1.15 in src/runtime/pprof. See https://fuchsia.googlesource.com/third_party/go/+diff/go1.15..master.
Possibly a regression of #35538, related to #37446 and its fix in https://golang.org/cl/221577.
@prattmic