runtime/pprof: panic in appendLocsForStack on fuchsia-arm64

tamird commented 4 years ago

We've been seeing this issue for a while; filing this bug now that we've upgraded to go1.15. (See https://fxbug.dev/52575).

This happens sometimes when running the runtime/pprof tests:

panic: runtime error: slice bounds out of range [3:2]

goroutine 6 [running]:
runtime/pprof.(*profileBuilder).appendLocsForStack(0x3e93c15c160, 0x3e93c130840, 0x0, 0x8, 0x3e93c0560b0, 0x2, 0x2, 0x0, 0x2, 0x8) 
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:410 +0x5d4
runtime/pprof.(*profileBuilder).build(0x3e93c15c160)
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:365 +0xf4
runtime/pprof.profileWriter(0xb4050418c0, 0x3e93c386030)
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:813 +0xd4
created by runtime/pprof.StartCPUProfile
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:784 +0x114

There's no diff between Fuchsia's fork and go1.15 in src/runtime/pprof. See https://fuchsia.googlesource.com/third_party/go/+diff/go1.15..master.

Possibly a regression of #35538, related to #37446 and its fix in https://golang.org/cl/221577.

@prattmic

randall77 commented 4 years ago

@hyangah

prattmic commented 4 years ago

I agree that at face value this looks like a regression of #35538, likely in http://golang.org/cl/221577.

http://golang.org/cl/221577 was merged into the Fuchsia tree in https://fuchsia-review.googlesource.com/c/third_party/go/+/374435 (April 4).

However, https://fxbug.dev/52575 was filed on May 13 just a few hours after https://fuchsia-review.googlesource.com/c/third_party/go/+/389033 (Go 1.14.3 merge), which IMO strongly implicates something in that merge.

@tamird do you know if the May 13 bug was really the first cases of this you saw, or was there perhaps some earlier bug that's not attached?

tamird commented 4 years ago

Yes, there were earlier bugs; sorry to have mislead you!

https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=44718 was filed on January 24 https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=47563 was filed on March 3

prattmic commented 4 years ago

Ah, interesting.

https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=44718

I don't recognize this crash (log, search for pprof) (@hyangah maybe you do?), but it does not seem related to this bug.

https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=47563

This crash looks like #38605. The May 13 merge included the fix for this. Though it is not clear to me if the appendLocsForStack panics were really new after that merge or the other allocToCache crash was just so much more frequent that we never saw the appendLocsForStack panic.

hyangah commented 4 years ago

Ah, interesting.

https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=44718

I don't recognize this crash (log, search for pprof) (@hyangah maybe you do?), but it does not seem related to this bug.

TestBlockProfile failed with debug=1 option. That's not the code path that exercises the proto encoding code for debug>0. Not sure why the block profile does not include the expected channel close event, but it's not a related bug to appendLocsForStack.

https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=47563

This crash looks like #38605. The May 13 merge included the fix for this. Though it is not clear to me if the appendLocsForStack panics were really new after that merge or the other allocToCache crash was just so much more frequent that we never saw the appendLocsForStack panic.

gopherbot commented 4 years ago

Change https://golang.org/cl/248728 mentions this issue: runtime/pprof: add logging for debugging issue 40823

tamird commented 4 years ago

With the change above applied, here's what we see:

=== RUN   TestCPUProfileMultithreaded
stk:#   0xde474a80eb    runtime.nanotime1+0x10b /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/os_fuchsia.go:478
#       0xde474b1f83    runtime._System+0x3     /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/proc.go:3911
l.pcs:# 0xde474a80eb    runtime.nanotime1+0x10b /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/os_fuchsia.go:478
#       0xde474a80e4    runtime.nanotime+0x104  /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/time_nofake.go:19
#       0xde474a8050    runtime.suspendG+0x70   /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/preempt.go:252
panic: runtime error: slice bounds out of range [3:2]

goroutine 6 [running]:
runtime/pprof.(*profileBuilder).appendLocsForStack(0x1fd1c0b3340, 0x1fd1c130840, 0x0, 0x8, 0x1fd1c216130, 0x2, 0x2, 0x0, 0x6, 0x8) 
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:415 +0x640
runtime/pprof.(*profileBuilder).build(0x1fd1c0b3340)
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:366 +0xf4
runtime/pprof.profileWriter(0xde4768f8c0, 0x1fd1c10a120)
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:813 +0xd4
created by runtime/pprof.StartCPUProfile
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:784 +0x114

prattmic commented 4 years ago

Ah, this is very helpful. It looks like this might be a bug in runtime.expandFinalInlineFrame.

From the PCs, I suspect that nanotime1 and nanotime are inlined into suspendG, so expandFinalInlineFrame should have expanded the stack to include nanotime and suspendG, but it seems that it didn't for some reason.

Could you provide a copy of the crashing binary, or tell me how to build it so I can take a closer look?

prattmic commented 4 years ago

So I think the problem here is that expandFinalInlineFrame doesn't account for the magic _System, _GC, etc frames. It should expand the second frame if one of those appear at the bottom. I'll send out a patch for this.

That said, I don't understand why the traceback would be failing here: https://fuchsia.googlesource.com/third_party/go/+/refs/heads/master/src/runtime/proc.go#4069

prattmic commented 4 years ago

Regarding why the traceback is failing in the first place:

Since nanotime1 does call into the VDSO, it is possible there is some interaction with the handling of vdsoPC/vdsoSP or some weirdness with how profiling signals are delivered on Fuchsia (haven't looked at that much), though I'll want to take a closer at the binary.

As an aside, you all will want to port http://golang.org/cl/246763 for Fuchsia. I don't think that is the direct cause of these crashes, but is certainly suspicious.

tamird commented 4 years ago

https://gofile.io/d/hlPlJh

prattmic commented 4 years ago

If you can reproduce this issue, I think it would be valuable to see if making nanotime1 reentrant fixes the issue, to rule that out. I wrote https://fuchsia-review.googlesource.com/c/third_party/go/+/419640, which I believe is what you need, though it is untested since I don't have a Fuchsia environment.

tamird commented 4 years ago

I haven't been able to reproduce locally, but the issue only appears on arm64. The previous file I provided was the amd64 binary; here's the arm64: https://gofile.io/d/xPjFh3.

prattmic commented 4 years ago

Reproducing only on ARM64 makes me wonder if this is a memory ordering issue.

VDSO calls set m.vdsoPC followed by m.vdsoSP. This order is important, because if sigprof observes that m.vdsoSP is set, then it assumes m.vdsoPC is valid. On Unix, sigprof is running on the same thread as the m it is profiling, therefore no memory synchronization is necessary.

However, on Fuchsia it is running on a different thread. Thus given that ARM allows stores to be reordered after stores, sigprof may actually observe non-zero m.vdsoSP without a valid m.vdsoPC.

Before calling sigprof, we suspend the target thread with zx_task_suspend_token. Does that call include a complete memory barrier to ensure we see fully-consistent values? The docs don't specify.

On AMD64, stores are ordered, so without a memory barrier we will still see consistent values (though perhaps stale!).

I see that the Windows port has a very similar structure, though there is no Windows ARM64 port, and I don't know if _SuspendThread includes memory barrier either.

cc @aclements @mknyszek to see if this sounds right, plus if you know more about Windows.

mknyszek commented 4 years ago

@prattmic Your theory seems plausible to me, though I'm still tracking down how "bad m.vdsoPC" precisely translates into this crash. I suppose what this means is the gentraceback based on mp.vdsoPC and mp.vdsoSP will fail, we'll fall into the next case (the one you mentioned earlier with funcPC(_System)) and that's why stk gets populated the way it does.

Oh, I see, in an earlier post you posit that nanotime and nanotime1 are inlined into suspendG at the PC we encounter, so after expandFinalInlineFrame the _System is the extra element causing the crash. Have you confirmed yet that they got inlined? If so, I can see this all connecting together. For this inlined case the profiling code implicitly expects not to find a symbol that's marked as inline if it was marked with the pseudo-symbol _System, but because we observe a bad vdsoPC we end up falling into the wrong case (actually, any ordering where they're not both observed to be set could lead us to drop down into that case, making this somewhat likelier).

If this is the problem then we need either (1) synchronization of vdsoPC and vdsoSP or (2) make the profiling code be able to handle this extra _System element in the inlined case (which is what you suggested earlier), though it means some stacks might not be attributed "correctly." Does all this align with your thinking @prattmic?

Windows is the only platform (AFAIK) we officially support that has the "suspend thread at a distance" behavior, everything else uses signals. We had a windows-arm port (not arm64, as you point out) going at some point, but the builder was routinely broken.

prattmic commented 4 years ago

Yes, nanotime and nanotime1 are inlined in the binary from https://github.com/golang/go/issues/40823#issuecomment-678767293. It looks like:

  :0                    0x7b0d8                 390083e1                MOVB R1, 32(RSP)                                
  :0                    0x7b0dc                 b4000284                CBZ R4, 20(PC)                                  
  :0                    0x7b0e0                 f90027e5                MOVD R5, 72(RSP)                                
  :0                    0x7b0e4                 d503201f                NOOP                                            
  :0                    0x7b0e8                 9400d2de                CALL syscall/zx.Sys_clock_get_monotonic(SB)     
  :0                    0x7b0ec                 f94007e0                MOVD 8(RSP), R0                                 
  :0                    0x7b0f0                 f94027e1                MOVD 72(RSP), R1                                
  :0                    0x7b0f4                 eb01001f                CMP R1, R0                                      
  :0                    0x7b0f8                 540000ca                BGE 6(PC)                                       
  :0                    0x7b0fc                 d2800140                MOVD $10, R0                                    
  :0                    0x7b100                 b9000be0                MOVW R0, 8(RSP)                                 
  :0                    0x7b104                 9400ad4f                CALL runtime.procyield(SB)                      
  :0                    0x7b108                 f94027e0                MOVD 72(RSP), R0                                
  :0                    0x7b10c                 17ffffc8                JMP -56(PC)                                     
  :0                    0x7b110                 9400b0dc                CALL runtime.osyield(SB)                        
  :0                    0x7b114                 d503201f                NOOP                                            
  :0                    0x7b118                 9400d2d2                CALL syscall/zx.Sys_clock_get_monotonic(SB)     
  :0                    0x7b11c                 f94007e0                MOVD 8(RSP), R0                                 
  :0                    0x7b120                 d282711b                MOVD $5000, R27                                 
  :0                    0x7b124                 8b1b0000                ADD R27, R0, R0                                 
  :0                    0x7b128                 17ffffc1                JMP -63(PC)                                     
...

Note that this isn't the exact same binary that crashed, but these instructions match up almost perfectly with the stack, so perhaps it is identical.

I think we ultimately need (1) as you suggest. (2) should stop the immediate crashing, but stacks will be misattributed, and I wonder if there are other odd cases we'll hit with odd memory ordering.

We could make these vdso fields explicitly atomics, or perhaps the ARM64 assembly should just manually include DMB/DSB instructions at the right point?

Of course, this is predicated on the assumption that zx_task_suspend_token doesn't include an implicit memory barrier. I briefly looked at the Zircon source and didn't notice any, but I'm far from familiar. @tamird do you know, or know who would be a better authority on this?

prattmic commented 4 years ago

w.r.t. Windows, I forgot that Windows has no VDSO and thus doesn't use vdsoSP/vdsoPC at all! That said, it has very similar libcallsp/libcallpc, which may have similar issues.

dhobsd commented 4 years ago

This makes sense. I believe there are implicit memory barriers in zx_object_wait_one (due to locking) -- which is called to wait for the profiled thread to stop after the zx_task_suspend_token but before sigprof is called. My familiarity with ARM RMO is poor; wouldn't we also need a barrier between the store to vdsoPC and vdsoSP to guarantee remote stores are seen in-order? I think sequential consistency is easier to reason about; my vote would be to make these atomic given that those APIs are generally expected to make that guarantee.

odeke-em commented 3 years ago

Thanks everyone for the investigations, and for the patience. Punting to Go1.17.

tamird commented 3 years ago

We stopped seeing this for a while, but having just upgraded to go1.16rc1, it's back.

test.log

EDIT: correction, we stopped seeing it because we started skipping the previous occurrence. It seems that we now have a new occurrence.

odeke-em commented 3 years ago

Ouch, sorry about that @tamird, and thank you for the report. I shall be reinstating as a release blocker for Go1.16

dmitshur commented 3 years ago

I don't think this needs to block the Go 1.16 release. Given that Go 1.16 RC 1 is already out and the bar for fixes is very high, it doesn't seem that this can make it to Go 1.16. I'll reset the milestone to Backlog, and it can be updated when the timing of the fix is more clear.

dhobsd commented 3 years ago

The fence was added, and we're still seeing this come up in the same context.

I was never able to repro this on a 128-core, multi-node Cavium system with innumerable iterations over several days, even with cpusets that should've guaranteed at least half of the threads to run on a different node (and thus increase the likelihood of a memory ordering issue).

I'm not happy blaming this on qemu without actually understanding the problem, but I'm wondering if this is a similar issue to #9605. These ARM tests do run in a modern descendant of the qemu-based emulator that was being used in that issue, but that issue suggested this was flaky on real hardware, too. However, I have a hard time reconciling not being able to reproduce the issue once (in who-knows-how-many thousands of runs) on real hardware with it happening quite regularly (~hundreds or thousands of runs) in a qemuish arm64 env.

tamird commented 2 years ago

The benchmark that was producing this failure has been disabled until yesterday; having enabled it, we're now seeing this even on fuchsia-x64; e.g. this build.

stk:#   0x179ab65affb   runtime.(*gcBits).bytep+0x13b   /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/mheap.go:1938
#       0x179ab66cbe0   runtime._ExternalCode+0x0       /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/proc.go:4668
l.pcs:# 0x179ab65affb   runtime.(*gcBits).bytep+0x13b   /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/mheap.go:1938
#       0x179ab65afe3   runtime.(*gcBits).bitp+0x123    /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/mheap.go:1944
#       0x179ab65affc   runtime.(*mspan).markBitsForIndex+0x13c /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/mbitmap.go:257
#       0x179ab65afe2   runtime.wbBufFlush1+0x122       /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/mwbbuf.go:266
panic: runtime error: slice bounds out of range [4:2]
goroutine 34 [running]:
runtime/pprof.(*profileBuilder).appendLocsForStack(0x5dac058000, {0x5dac102300, 0x0, 0x20}, {0x5dac49a450, 0x2, 0x2})
        /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:414 +0x893
runtime/pprof.(*profileBuilder).build(0x5dac058000)
        /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:365 +0x2c5
runtime/pprof.profileWriter({0x179aba3c500, 0x5dac108008})
        /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:813 +0x125
created by runtime/pprof.StartCPUProfile
        /b/s/w/ir/cache/incremental/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:784 +0x171

prattmic commented 2 years ago

The problem here is two-fold (as previously partially discussed in https://github.com/golang/go/issues/40823#issuecomment-679311736):

We have a stack with missing inline frames which expandFinalInlineFrame fails to expand because the truncated stack is followed by _ExternalCode, which it doesn't handle. expandFinalInlineFrame could be adjusted to handle this case.
We really shouldn't be getting _ExternalCode here at all. This is clearly in Go code. _ExternalCode comes from sigprofNonGoPC, which Fuchsia calls when profiling at https://fuchsia.googlesource.com/third_party/go/+/refs/heads/main/src/runtime/os_fuchsia.go#642. AFAICT, the logic here here is wrong: mp.curg is not always set. Fuchsia should almost certainly more closely match the Windows logic, which handles more cases: https://cs.opensource.google/go/go/+/master:src/runtime/os_windows.go;l=1221-1223. I'll send a CL for this.

prattmic commented 2 years ago

I've mailed (untested!) https://fuchsia-review.googlesource.com/c/third_party/go/+/629462, which I hope fixes this.

golang / go

runtime/pprof: panic in appendLocsForStack on fuchsia-arm64 #40823