Closed lizthegrey closed 2 years ago
Given that you commented on #52116, I assume that this application was also suffering from that bug before this (seemingly incomplete) fix?
This stack looks weird. 0x12eb5 is not even an aligned PC (ARM64 instructions are always 4 bytes). Why would it appear on the stack? 0x3ff93 from another stack trace is similarly weird.
Maybe they are not PCs? Given that a nearby stack word has content 0x00012ee60001ce60, maybe 0x12eb5, 0x12ee6, 0x1ce60 are not PCs but some (32-bit?) integers on stack?
On both stack traces the SIGPROF lands on PC 0x79650. What is that PC? Is it a Go function? Thanks.
Given that you commented on #52116, I assume that this application was also suffering from that bug before this (seemingly incomplete) fix?
~It was. But we backported the fix and it was working fine. The new change for us is turning on not just CPU/mem profiling, but block profiling too.~
This stack looks weird. 0x12eb5 is not even an aligned PC (ARM64 instructions are always 4 bytes). Why would it appear on the stack? 0x3ff93 from another stack trace is similarly weird.
Maybe they are not PCs? Given that a nearby stack word has content 0x00012ee60001ce60, maybe 0x12eb5, 0x12ee6, 0x1ce60 are not PCs but some (32-bit?) integers on stack?
On both stack traces the SIGPROF lands on PC 0x79650. What is that PC? Is it a Go function? Thanks.
$ go tool addr2line retriever
0x79650
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:321
Given that you commented on #52116, I assume that this application was also suffering from that bug before this (seemingly incomplete) fix?
It was. But we backported the fix and it was working fine. The new change for us is turning on not just CPU/mem profiling, but block profiling too.
Addendum: I just trawled errors, and indeed, we do have significant numbers of "traceback stuck" dating back multiple months, just in applications that were able to restart after crashing without coming to my attention.
Here are some additional crashes from other applications:
$ go tool addr2line barbet
0x79410
0x79414
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:321
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:322
$ go tool addr2line beagle
0x79410
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:321
$ go tool addr2line shepherd
0x790a0
0x791d0
0x791d4
runtime.systemstack
/usr/local/go/src/runtime/asm_arm64.s:204
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:321
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:322
https://github.com/golang/go/blob/go1.18.5/src/runtime/asm_arm64.s#L321 -- culprit lines here.
similarity to https://github.com/golang/go/issues/52829, does this need to not have NOFRAME ie https://go-review.googlesource.com/c/go/+/405482/?
Also repros with go1.19.0
runtime: traceback stuck. pc=0x58e9e0 sp=0x4035cfada0
stack: frame={sp:0x4035cfada0, fp:0x4035cfada0} stack=[0x4035cf8000,0x4035cfc000)
0x0000004035cfaca0: 0x0000004047098bd0 0x000000010001aae8
0x0000004035cfacb0: 0xc90000000001bfb0 0x0000ffff94c3f878
0x0000004035cfacc0: 0xc9c4a878b22454ee 0x0000000000000006
0x0000004035cfacd0: 0x000000000058db7c <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField+0x000000000000004c> 0x000000403cd55520
0x0000004035cface0: 0x000000000058dbf0 <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField+0x00000000000000c0> 0x0000004035cfad38
0x0000004035cfacf0: 0x00000040470acb20 0x0000004035cfad58
0x0000004035cfad00: 0x000000000058e980 <github.com/honeycombio/libhoney-go.(*Event).AddField+0x0000000000000110> 0x0000000000fcae00
0x0000004035cfad10: 0x0000004047098bd0 0x00000040469ff6c8
0x0000004035cfad20: 0x0000000000000014 0x0100000000000014
0x0000004035cfad30: 0x00000040470b2248 0x0000000000ebb3a0
0x0000004035cfad40: 0x000000000058dc50 <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField.func1+0x0000000000000000> 0x00000040470b2248
0x0000004035cfad50: 0x0000004035cfad40 0x0000004035cfadb8
0x0000004035cfad60: 0x00000000005a5898 <github.com/honeycombio/beeline-go/trace.(*Span).AddField+0x0000000000000158> 0x00000040470b2240
0x0000004035cfad70: 0x00000040469ff6c8 0x0000000000000014
0x0000004035cfad80: 0x0000000000ebb3a0 0x0000000001ed2ac8
0x0000004035cfad90: 0x01000000005a5840 0x0000000000f9a880
0x0000004035cfada0: >0x000000000058e9e0 <github.com/honeycombio/libhoney-go.(*Event).AddField.func1+0x0000000000000000> 0x00000040470b226c
0x0000004035cfadb0: 0x0000004035cfada0 0x0000004035cfae28
0x0000004035cfadc0: 0x00000000005a7a4c <github.com/honeycombio/beeline-go.AddField+0x00000000000000cc> 0x0000ffff94c3f878
0x0000004035cfadd0: 0x00000040469ff6c8 0x0000000001ed2ac8
0x0000004035cfade0: 0x0000000000ebb3a0 0x0000000001ed2ac8
0x0000004035cfadf0: 0x0000004035cfae08 0x0000004035cfae28
0x0000004035cfae00: 0x01000000005a7a00 0x0000000000f9a880
0x0000004035cfae10: 0x00000000005a5900 <github.com/honeycombio/beeline-go/trace.(*Span).AddField.func1+0x0000000000000000> 0x0000004043dc5688
0x0000004035cfae20: 0x0000004035cfae10 0x0000004035cfae88
0x0000004035cfae30: 0x0000000000d79884 <github.com/honeycombio/hound/cmd/shepherd/app.(*App).getSchema+0x00000000000000f4> 0x0000004043dc5600
0x0000004035cfae40: 0x00000040469ff6c8 0x0000000000000014
0x0000004035cfae50: 0x0000000000ebb3a0 0x0000000001ed2ac8
0x0000004035cfae60: 0x000000400015aae0 0x0000000000000014
0x0000004035cfae70: 0x0000000000ebb3a0 0x0000004043dc5600
0x0000004035cfae80: 0x00000040469ff6c8 0x0000004035cfaf48
0x0000004035cfae90: 0x0000000000d755bc <github.com/honeycombio/hound/cmd/shepherd/app.(*App).batchProcess.func1+0x000000000000043c> 0x0000000001491d40
fatal error: traceback stuck
goroutine 0 [idle]:
runtime.throw({0x11b3553?, 0x1ebf180?})
/usr/local/go/src/runtime/panic.go:1047 +0x40 fp=0x40009ac610 sp=0x40009ac5e0 pc=0x490f0
runtime.gentraceback(0x1f3c7e0?, 0x0?, 0x40009ac9a8?, 0x403cd55520, 0x0, 0x40009ac9a8, 0x40, 0x0, 0x0?, 0x6)
/usr/local/go/src/runtime/traceback.go:488 +0xde4 fp=0x40009ac940 sp=0x40009ac610 pc=0x703f4
runtime.sigprof(0x7abe0, 0x40009acc48?, 0x0?, 0x403cd55520, 0x40009a4000)
/usr/local/go/src/runtime/proc.go:4567 +0xe8 fp=0x40009acbb0 sp=0x40009ac940 pc=0x55618
runtime.sighandler(0x1b?, 0x40009a4000?, 0x40009acc98?, 0x40005e4000?)
/usr/local/go/src/runtime/signal_unix.go:629 +0x574 fp=0x40009acc60 sp=0x40009acbb0 pc=0x60334
runtime.sigtrampgo(0x1b, 0x40009acda0, 0x40009ace20)
/usr/local/go/src/runtime/signal_unix.go:479 +0x164 fp=0x40009acce0 sp=0x40009acc60 pc=0x5f9c4
runtime.sigtramp()
/usr/local/go/src/runtime/sys_linux_arm64.s:471 +0x58 fp=0x40009acda0 sp=0x40009acce0 pc=0x7e228
goroutine 1503875 [running]:
runtime: traceback stuck. pc=0x58e9e0 sp=0x400ffaeda0
stack: frame={sp:0x400ffaeda0, fp:0x400ffaeda0} stack=[0x400ffac000,0x400ffb0000)
0x000000400ffaeca0: 0x0000004046ce2240 0x000000010001f0b8
0x000000400ffaecb0: 0x7300000001243730 0x0000ffff9d953948
0x000000400ffaecc0: 0xc8f36a7a4ff431e7 0x0000000000000007
0x000000400ffaecd0: 0x000000000058db7c <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField+0x000000000000004c> 0x000000400cfa5520
0x000000400ffaece0: 0x000000000058dc00 <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField+0x00000000000000d0> 0x000000400ffaed38
0x000000400ffaecf0: 0x0000000000065708 <runtime.rawstringtmp+0x0000000000000048> 0x000000400ffaed58
0x000000400ffaed00: 0x000000000058e980 <github.com/honeycombio/libhoney-go.(*Event).AddField+0x0000000000000110> 0x0000000000fcae00
0x000000400ffaed10: 0x0000004046ce2240 0x0000004046110d50
0x000000400ffaed20: 0x0000000000000016 0x0100000000000016
0x000000400ffaed30: 0x00000040453eaf68 0x0000000000eccea0
0x000000400ffaed40: 0x000000000058dc50 <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField.func1+0x0000000000000000> 0x00000040453eaf68
0x000000400ffaed50: 0x000000400ffaed40 0x000000400ffaedb8
0x000000400ffaed60: 0x00000000005a5898 <github.com/honeycombio/beeline-go/trace.(*Span).AddField+0x0000000000000158> 0x00000040453eaf60
0x000000400ffaed70: 0x0000004046110d50 0x0000000000000016
0x000000400ffaed80: 0x0000000000eccea0 0x000000000147d320
0x000000400ffaed90: 0x01000000005a5840 0x0000000000f9a880
0x000000400ffaeda0: >0x000000000058e9e0 <github.com/honeycombio/libhoney-go.(*Event).AddField.func1+0x0000000000000000> 0x00000040453eaf8c
0x000000400ffaedb0: 0x000000400ffaeda0 0x000000400ffaee28
0x000000400ffaedc0: 0x00000000005a7a4c <github.com/honeycombio/beeline-go.AddField+0x00000000000000cc> 0x0000ffff9d953948
0x000000400ffaedd0: 0x0000004046110d50 0x000000000147d320
0x000000400ffaede0: 0x0000000000eccea0 0x000000000147d320
0x000000400ffaedf0: 0x000000400ffaee08 0x000000400ffaee28
0x000000400ffaee00: 0x01000000005a7a00 0x0000000000f9a880
0x000000400ffaee10: 0x00000000005a5900 <github.com/honeycombio/beeline-go/trace.(*Span).AddField.func1+0x0000000000000000> 0x00000040453c2b88
0x000000400ffaee20: 0x000000400ffaee10 0x000000400ffaee88
0x000000400ffaee30: 0x0000000000d79a6c <github.com/honeycombio/hound/cmd/shepherd/app.(*App).getSchema+0x00000000000002dc> 0x00000040453c2b00
0x000000400ffaee40: 0x0000004046110d50 0x0000000000000016
0x000000400ffaee50: 0x0000000000eccea0 0x000000000147d320
0x000000400ffaee60: 0x00000040006af0b0 0x0000000000000016
0x000000400ffaee70: 0x0000000000eccea0 0x00000040453c2b00
0x000000400ffaee80: 0x0000004046110d50 0x000000400ffaef48
0x000000400ffaee90: 0x0000000000d755bc <github.com/honeycombio/hound/cmd/shepherd/app.(*App).batchProcess.func1+0x000000000000043c> 0x0000000001491d40
fatal error: traceback stuck
goroutine 0 [idle]:
runtime.throw({0x11b3553?, 0x1ebf180?})
/usr/local/go/src/runtime/panic.go:1047 +0x40 fp=0x400078a610 sp=0x400078a5e0 pc=0x490f0
runtime.gentraceback(0x1f3c7e0?, 0x0?, 0x400078a9a8?, 0x400cfa5520, 0x0, 0x400078a9a8, 0x40, 0x0, 0x0?, 0x6)
/usr/local/go/src/runtime/traceback.go:488 +0xde4 fp=0x400078a940 sp=0x400078a610 pc=0x703f4
runtime.sigprof(0x7abe4, 0x400078ac48?, 0x0?, 0x400cfa5520, 0x4000780000)
/usr/local/go/src/runtime/proc.go:4567 +0xe8 fp=0x400078abb0 sp=0x400078a940 pc=0x55618
runtime.sighandler(0x1b?, 0x4000780000?, 0x400078ac98?, 0x40007821a0?)
/usr/local/go/src/runtime/signal_unix.go:629 +0x574 fp=0x400078ac60 sp=0x400078abb0 pc=0x60334
runtime.sigtrampgo(0x1b, 0x400078ada0, 0x400078ae20)
/usr/local/go/src/runtime/signal_unix.go:479 +0x164 fp=0x400078ace0 sp=0x400078ac60 pc=0x5f9c4
runtime.sigtramp()
/usr/local/go/src/runtime/sys_linux_arm64.s:471 +0x58 fp=0x400078ada0 sp=0x400078ace0 pc=0x7e228
goroutine 16981043 [running]:
go tool addr2line shepherd
0x7abe4
0x7abe0
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:324
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:323
https://github.com/golang/go/blob/go1.19/src/runtime/asm_arm64.s#L323-L324
Thanks for the details @lizthegrey !
similarity to https://github.com/golang/go/issues/52829, does this need to not have NOFRAME ie https://go-review.googlesource.com/c/go/+/405482/?
I don't think this is necessary. The ARM32 function contains a call in some configurations, whereas the ARM64 function should not contain a call.
So the signal lands at the entry of morestack_noctxt
or systemstack
. At this point the stack switch hasn't occurred so it should be able to traceback normally. I don't immediately see what is wrong. I'll keep looking.
If you need a workaround you can try comment out the "traceback stuck" throw. The profile sample's stacks may be inaccurate (which probably was the case before that CL) but it probably won't crash.
Need any additional debugging info? Any way that I can help?
We have special cases for morestack and systemstack here, but I don't believe they are relevant because (a) funcID_morestack
only applies to morestack
, not morestack_noctxt
, and (b) we should only hits these if we already switched stacks, which is not the case here.
@lizthegrey is this readily reproducible (ideally OSS, but I'm also curious if you could manually reproduce)? Or only occurs in prod?
It reproduces multiple times per day in all of our environments. Sadly I'm not sure I can reduce to a minimal OSS test case.
One thing I'd be happy to do is to run a version of this that is a branch build that patches go runtime to hang rather than panic when this occurs, then I'd be happy to ssh in and dlv the binary.
Change https://go.dev/cl/424196 mentions this issue: DO NOT SUBMIT: runtime: add dlog to gentraceback
Change https://go.dev/cl/424195 mentions this issue: DO NOT SUBMIT: runtime: dump entire stack
If you are willing to patch the runtime, there are a few changes that would be helpful.
https://go.dev/cl/424195: This just expands the stack dump to include the entire stack, which will help see more context. Since this only runs when crashing, it won't have any impact on normal execution.
https://go.dev/cl/424196: This adds debuglog
logs to trace the behavior of gentraceback, which will help us see how to gets into the bad state. debuglog
is a low-overhead in-memory ring buffer log in the runtime, which you must enable with go build -tags debuglog
[1]. The log is dumped to stderr on crash. Without the build tag, the code is a complete no-op. With the build tag, it is logging every execution of gentraceback
, so there may be a small performance degradation, but I don't think it will be noticeable in real-world applications.
I think the latter is a bit simpler than attaching with a debugger, since it can track iteration state (previous frames) that will be gone once we decide to crash. Plus you don't need to watch for hung tasks and manually inspect them, but that is an option as well if the patch is concerning.
[1] If adding tags to your build is difficult for some reason, you can also swap the build tag lines at the top of debuglog_off.go and debuglog_on.go.
There's some slew since 1.18.5:
/usr/local/go/src/runtime/traceback.go:47:9: cannot use gp.goid (variable of type int64) as type uint64 in argument to d.u64
I'll fix the patch vs our base go version and try this overnight.
Fresh crash for you with the additional telemetry.
There's some slew since 1.18.5:
Oops, apologies, I made this CL based on tip rather than 1.18. Did you get this fixed? If not I can provide another on 1.18.
Fresh crash for you with the additional telemetry.
Did this get built with -tags debuglog
? The is missing, though perhaps just not included in your paste. The log will be output after all of the crash stack traces just before exit.
Did this get built with -tags debuglog? The is missing, though perhaps just not included in your paste. The log will be output after all of the crash stack traces just before exit.
I thought it did, but apparently it's missing them. Does go install
accept tags, or only go build
?
go install -tags debuglog -ldflags "${LD_FLAGS} \
-X github.com/honeycombio/hound/env.buildID=$BUILD_ID \
-X github.com/honeycombio/hound/env.commitHash=$CIRCLE_SHA1" \
./...
Yes, go install
accepts tags.
I also see that the stack dump doesn't include the entire stack either (https://go.dev/cl/424195). I'd guess that your build isn't picking up the patches to the runtime at all (e.g., wrong GOROOT), except that it clearly did when you got a build error...
You could double check the symbols in the binary. If debuglog is enabled, it should have various dlogger
symbols like runtime.(*dlogger).end
, if not those are missing.
oh, whoops, good catch.
Roles pinned:
beagle: 603443
barbet: 603443
that binary is pinned to an older build. the build instructions were correct, prod just didn't have the correct one. egg on my face, sorry, I'll re-try and let you know when we have a crash with the correct build.
Thanks. By the way, do you happen to be using musl libc (/Alpine Linux)? (Is this even a cgo binary?)
I'm investigating #54306 in parallel, presumably an issue with musl. It is probably unrelated, but it would help to know if you are using musl.
Thanks. By the way, do you happen to be using musl libc (/Alpine Linux)? (Is this even a cgo binary?)
I'm investigating #54306 in parallel, presumably an issue with musl. It is probably unrelated, but it would help to know if you are using musl.
No, we run on Ubuntu 22.04 containers with standard libc 2.35-0ubuntu3.1
. No use of cgo
Thanks, I'll take a look! Does this one have debuglog output at the end? (Lots of lines mostly containing gentrace: ...
)
Oh, no, there is nothing with "gentrace" at the end.
Hrm, that is quite frustrating.
What I can tell from the stack dump is that the SP/FP we've landed at is certainly not a valid frame pointer. It is hard to tell, but the stack doesn't look obviously corrupt to me. Perhaps the stack is OK, but we computed the wrong next frame pointer.
I'll see if I can put more information in another patch, even if debuglog fails.
New patches, now based on the 1.18 release branch, so they should apply cleanly:
-tags debuglog
.Change https://go.dev/cl/424077 mentions this issue: DO NOT SUBMIT: runtime: add dlog to gentraceback
Change https://go.dev/cl/424076 mentions this issue: DO NOT SUBMIT: runtime: dump entire stack
Go ahead and base on 1.19 going forward, I just upgraded our prod environments...
This patchset worked fine though with offset:
patching file src/runtime/traceback.go
patching file src/runtime/debuglog_off.go
patching file src/runtime/debuglog_on.go
patching file src/runtime/proc.go
Hunk #1 succeeded at 4511 (offset 60 lines).
patching file src/runtime/traceback.go
Still not seeing debuglog in output. Are we terminating before debuglog can be printed?
https://gist.github.com/lizthegrey/58174372484922b47c69c147a4c9f1f8
It's possible, though I'm surprised there is no further indication of crash if true. Do you see "stack trace unavailable" at the end? Curiously, I've been using the same patch to investigate #54306 and it is working fine.
On another note, I've noticed that most of the stacks so far seem to be in code around github.com/Shopify/sarama.(*RecordBatch).decode
(and various other sarama
decode
functions) which calls into github.com/klauspost/compress/zstd
functions. Do you know (from profiles or intuition) if this code path is very hot in this application (so crashing here is likely just coincidence), or fairly rare (so perhaps not a coincidence)? There was one crash above in github.com/honeycombio/libhoney-go
, so I lean towards coincidence.
@lizthegrey @prattmic thanks for the information! The most recent one doesn't have morestack or systemstack, so it may be not limited to that.
I'm still puzzled by the unaligned-PC-like values, e.g. 0x11b2b, which is
<runtime/internal/atomic.(*Uint8).And+0x000000000000003b>
(it is not only unaligned but also past the end of the function, so the execution should never get there).
It may be helpful to print the PCs it has unwound through, i.e. the content of pcbuf
, so we can see how it gets there. (SPs would also be good, but they are not stored anywhere so it's hard to print. PCs plus the content of the stack is probably enough.)
@lizthegrey do you have any run without profiling turned on, and if so, is there any crash? It looks to me that the content of the stack is weird so even a non-profiling run might crash. Thanks.
I'm still puzzled by the unaligned-PC-like values, e.g. 0x11b2b
I was confused by those as well. There are some really odd ones like runtime.debugCallWrap.func1
. My guess was that since the values are fairly small, they are probably just random arguments to these functions and not PCs at all.
@lizthegrey do you have any run without profiling turned on, and if so, is there any crash? It looks to me that the content of the stack is weird so even a non-profiling run might crash. Thanks.
No profiling = no crash. We've had copies of some of these binaries running without profiling for months, that have not crashed, when we enabled profiling they immediately started crashing about once per day out of 140 processes.
It's possible, though I'm surprised there is no further indication of crash if true. Do you see "stack trace unavailable" at the end? Curiously, I've been using the same patch to investigate #54306 and it is working fine.
Oddly, no. What's also odd is that the kubectl log
output stops after a small number of goroutines, rather than printing all the way through. I'll see if our fluentbit exporter got the entire thing and report back.
On another note, I've noticed that most of the stacks so far seem to be in code around
github.com/Shopify/sarama.(*RecordBatch).decode
(and various othersarama
decode
functions) which calls intogithub.com/klauspost/compress/zstd
functions. Do you know (from profiles or intuition) if this code path is very hot in this application (so crashing here is likely just coincidence), or fairly rare (so perhaps not a coincidence)? There was one crash above ingithub.com/honeycombio/libhoney-go
, so I lean towards coincidence.
It's a coincidence, sarama/zstd is about 30-40% of the workload of this app, and libhoney-go is 10%.
What's also odd is that the kubectl log output stops after a small number of goroutines
Huh. I'd noticed that your logs stop at the second goroutine header and assumed you were just manually truncating. Is that actually where the output stops?
I've updated https://go.dev/cl/424196 to:
pcbuf
.Could you give it a try?
Huh. I'd noticed that your logs stop at the second goroutine header and assumed you were just manually truncating. Is that actually where the output stops?
It is indeed. on a recent crash, even pulling the logs from fluentbit, they're truncated directly after the second goroutine:
/usr/local/go/src/runtime/sys_linux_arm64.s:471 +0x58 fp=0x400081cda0 sp=0x400081cce0 pc=0x80258
goroutine 469267 [running]:
and then it stops.
I've updated go.dev/cl/424196 to:
- Print debuglog before the panic tracebacks. Maybe this will get it out?
- Remember and dump the first 32 frames we walk (which is primarily what I want out of debuglog).
- Dump the contents of
pcbuf
.Could you give it a try?
Perfect, will give it a try.
Finally got a good dump including debug logs for you. https://gist.github.com/lizthegrey/0ed6b93167f618fa0d09becf975888c4
Awesome, thanks!
Here's what I see:
Caveat: The debuglog is still a bit corrupted, I guess because I printed early while other threads were still writing to it. That's OK though, the frame history has most of what I want.
This trace goes through libhoney.(*fieldHolder).AddField
. See the source here.
previous frame 0 {pc:0x7cdc0, lr:0x24cb8, sp:0x406e1b24c0, fp:0x406e1b24c0}
previous frame 1 {pc:0x24cb8, lr:0x241a4, sp:0x406e1b24c0, fp:0x406e1b24c0}
I'm not certain what these functions are, as they don't appear in the stack dump.
Based on my own build of 1.19, mapassign_faststr+0x114
is the return site from runtime.growWork_faststr
.
The offset from there is 0x24cb8
yields the return site from runtime.morestack_noctxt.abi0
in runtime.evacuate_faststr
.
So 0x7cdc0
is probably in morestack
?
previous frame 2 {pc:0x241a4, lr:0x241a4, sp:0x406e1b24c0, fp:0x406e1b2530}
previous frame 3 {pc:0x241a4, lr:0x59244c, sp:0x406e1b2530, fp:0x406e1b25a0}
0x241a4
is <runtime.mapassign_faststr+0x0000000000000114>
. At the moment, I am not sure why there is an extra frame with pc == lr
. It might be because 0x241a4
is actually an instruction from runtime.tophash
inlined into runtime.mapassign_faststr
.
previous frame 4 {pc:0x59244c, lr:0x592520, sp:0x406e1b25a0, fp:0x406e1b2600}
0x59244c
is <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField+0x000000000000004c>
. Based on my build of libhoney, AddField+0x4c
is the return site from sync.(*RWMutex).Lock
, not runtime.mapassign_faststr
(there is also a call to that later in the function at +0x90
). This seems wrong...
previous frame 5 {pc:0x592520, lr:0x5932b0, sp:0x406e1b2600, fp:0x406e1b2600}
0x0592520
is <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField.func1+0x0000000000000000>
. AddField.func1
is the wrapper for defer f.lock.Unlock()
. This doesn't make sense, because the wrapper does not call the main function, plus AddField.func1+0x0
is the beginning of the function and definitely not a CALL.
Presumably we get stuck here because the SP delta is 0, as SP hasn't been adjusted at function entry.
(I will note that the instruction just before this is JMP github.com/honeycombio/libhoney-go.(*fieldHolder).AddField(SB)
at the end of AddField
, for restarting after morestack
. Still not a CALL, but a typical pc-4
would put us in a different function).
I see lots of oddities here, though I'm not sure yet exactly where the problem is.
@lizthegrey could you help double-check if my compilation of libhoney matches yours by posting the disassembly of github.com/honeycombio/libhoney-go.(*fieldHolder).AddField
and github.com/honeycombio/libhoney-go.(*fieldHolder).AddField.func1
from the actual binary that crashed? (go tool objdump <binary>
, look for TEXT github.com/honeycombio/libhoney-go.(*fieldHolder).AddField
).
Also please send over additional crashes if you get them.
Thanks!
previous frame 0 {pc:0x7cdc0, lr:0x24cb8, sp:0x406e1b24c0, fp:0x406e1b24c0}
previous frame 1 {pc:0x24cb8, lr:0x241a4, sp:0x406e1b24c0, fp:0x406e1b24c0}
previous frame 2 {pc:0x241a4, lr:0x241a4, sp:0x406e1b24c0, fp:0x406e1b2530}
This is weird. It shouldn't be possible to have three frames to have the same SP. And frame 2's PC = LR, which also makes no sense. Do you know what functions are for PC 0x7cdc0 and 0x24cb8? Thanks.
It is also interesting that debuglog doesn't seem to include these frames...
Additional crashes all appended to the same secret gist (see traceback{2,3,4,...})
code for Sarama and Libhoney should be available in OSS -- see https://github.com/honeycombio/libhoney-go and https://github.com/shopify/sarama
I'll post the disassembly as seen in our compiled code.
in the binary (shepherd) that generated tracebacks 1-3:
TEXT github.com/honeycombio/libhoney-go.(*Client).AddField(SB) /mnt/ramdisk/go/pkg/mod/github.com/honeycombio/libhoney-go@v1.15.8/client.go
client.go:193 0x5913e0 f9400b90 MOVD 16(R28), R16
client.go:193 0x5913e4 eb3063ff CMP R16, RSP
client.go:193 0x5913e8 54000309 BLS 24(PC)
client.go:193 0x5913ec f81c0ffe MOVD.W R30, -64(RSP)
client.go:193 0x5913f0 f81f83fd MOVD R29, -8(RSP)
client.go:193 0x5913f4 d10023fd SUB $8, RSP, R29
client.go:193 0x5913f8 f9002be1 MOVD R1, 80(RSP)
client.go:193 0x5913fc f90027e0 MOVD R0, 72(RSP)
client.go:193 0x591400 f9002fe2 MOVD R2, 88(RSP)
client.go:193 0x591404 f90033e3 MOVD R3, 96(RSP)
client.go:193 0x591408 f90037e4 MOVD R4, 104(RSP)
client.go:194 0x59140c 97fffe7d CALL github.com/honeycombio/libhoney-go.(*Client).ensureTransmission(SB)
client.go:195 0x591410 f94027e0 MOVD 72(RSP), R0
client.go:195 0x591414 97fffefb CALL github.com/honeycombio/libhoney-go.(*Client).ensureBuilder(SB)
client.go:196 0x591418 f94027e1 MOVD 72(RSP), R1
client.go:196 0x59141c f9401021 MOVD 32(R1), R1
client.go:196 0x591420 3980003b MOVB (R1), R27
client.go:196 0x591424 9100e020 ADD $56, R1, R0
client.go:196 0x591428 f9402be1 MOVD 80(RSP), R1
client.go:196 0x59142c f9402fe2 MOVD 88(RSP), R2
client.go:196 0x591430 f94033e3 MOVD 96(RSP), R3
client.go:196 0x591434 f94037e4 MOVD 104(RSP), R4
client.go:196 0x591438 940003f2 CALL github.com/honeycombio/libhoney-go.(*fieldHolder).AddField(SB)
client.go:197 0x59143c a97ffbfd LDP -8(RSP), (R29, R30)
client.go:197 0x591440 910103ff ADD $64, RSP, RSP
client.go:197 0x591444 d65f03c0 RET
client.go:193 0x591448 f90007e0 MOVD R0, 8(RSP)
client.go:193 0x59144c f9000be1 MOVD R1, 16(RSP)
client.go:193 0x591450 f9000fe2 MOVD R2, 24(RSP)
client.go:193 0x591454 f90013e3 MOVD R3, 32(RSP)
client.go:193 0x591458 f90017e4 MOVD R4, 40(RSP)
client.go:193 0x59145c aa1e03e3 MOVD R30, R3
client.go:193 0x591460 97ebae58 CALL runtime.morestack_noctxt.abi0(SB)
client.go:193 0x591464 f94007e0 MOVD 8(RSP), R0
client.go:193 0x591468 f9400be1 MOVD 16(RSP), R1
client.go:193 0x59146c f9400fe2 MOVD 24(RSP), R2
client.go:193 0x591470 f94013e3 MOVD 32(RSP), R3
client.go:193 0x591474 f94017e4 MOVD 40(RSP), R4
client.go:193 0x591478 17ffffda JMP github.com/honeycombio/libhoney-go.(*Client).AddField(SB)
client.go:193 0x59147c 00000000 ?
TEXT github.com/honeycombio/libhoney-go.(*fieldHolder).AddField.func1(SB) /mnt/ramdisk/go/pkg/mod/github.com/honeycombio/libhoney-go@v1.15.8/libhoney.go
libhoney.go:567 0x592520 f9400b90 MOVD 16(R28), R16
libhoney.go:567 0x592524 eb3063ff CMP R16, RSP
libhoney.go:567 0x592528 54000169 BLS 11(PC)
libhoney.go:567 0x59252c f81e0ffe MOVD.W R30, -32(RSP)
libhoney.go:567 0x592530 f81f83fd MOVD R29, -8(RSP)
libhoney.go:567 0x592534 d10023fd SUB $8, RSP, R29
libhoney.go:567 0x592538 f9401390 MOVD 32(R28), R16
libhoney.go:567 0x59253c b5000130 CBNZ R16, 9(PC)
libhoney.go:567 0x592540 f9400740 MOVD 8(R26), R0
libhoney.go:567 0x592544 97ebdf5b CALL sync.(*RWMutex).Unlock(SB)
libhoney.go:567 0x592548 a97ffbfd LDP -8(RSP), (R29, R30)
libhoney.go:567 0x59254c 910083ff ADD $32, RSP, RSP
libhoney.go:567 0x592550 d65f03c0 RET
libhoney.go:567 0x592554 aa1e03e3 MOVD R30, R3
libhoney.go:567 0x592558 97eba9fa CALL runtime.morestack.abi0(SB)
libhoney.go:567 0x59255c 17fffff1 JMP github.com/honeycombio/libhoney-go.(*fieldHolder).AddField.func1(SB)
libhoney.go:567 0x592560 f9400211 MOVD (R16), R17
libhoney.go:567 0x592564 9100a3f4 ADD $40, RSP, R20
libhoney.go:567 0x592568 eb11029f CMP R17, R20
libhoney.go:567 0x59256c 54fffea1 BNE -11(PC)
libhoney.go:567 0x592570 910023f4 ADD $8, RSP, R20
libhoney.go:567 0x592574 f9000214 MOVD R20, (R16)
libhoney.go:567 0x592578 17fffff2 JMP -14(PC)
libhoney.go:567 0x59257c 00000000 ?
traceback 4 is from a different binary, let me know what disasm you need from that.
0x7cdc0
TEXT github.com/gogo/protobuf/proto.makeTimePtrSliceMarshaler.func2(SB) /mnt/ramdisk/go/pkg/mod/github.com/gogo/protobuf@v1.3.2/proto/table_marshal_gogo.go [...] value.go:1616 0x7cdbf8 f1005cff CMP $23, R7 value.go:1616 0x7cdbfc 54000061 BNE 3(PC) value.go:1617 0x7cdc00 f9400427 MOVD 8(R1), R7 table_marshal_gogo.go:257 0x7cdc04 1400000a JMP 10(PC) value.go:1619 0x7cdc08 97e41bae CALL reflect.Value.lenNonSlice(SB) table_marshal_gogo.go:258 0x7cdc0c f9404be1 MOVD 144(RSP), R1
0x24cb8
TEXT hash/crc32.MakeTable(SB) /usr/local/go/src/hash/crc32/crc32.go [...] crc32_generic.go:29 0x24cb80 14000005 JMP 5(PC) crc32.go:137 0x24cb84 a97ffbfd LDP -8(RSP), (R29, R30) crc32.go:137 0x24cb88 9100c3ff ADD $48, RSP, RSP crc32.go:137 0x24cb8c d65f03c0 RET
libhoney-go
I've identified why your build output for libhoney-go may be different from mine. this repo is lagging the mainline a bit.
go 1.18
require (
[...]
github.com/honeycombio/libhoney-go v1.15.8
github.com/Shopify/sarama v1.31.0
TEXT github.com/honeycombio/libhoney-go.(*Client).AddField(SB)
This is Client.AddField
. Could you post fieldHolder.AddField
?
Also for 0x7cdc0
, it looks like you posted 0x7cdc00
instead (off by an order of magnitude). Same with 0x24cb8
.
So 0x7cdc0 is probably in morestack
If this is in morestack, I think that makes sense. morestack
itself doesn't have a frame, so that's "frame 0" above. Its caller (evacuate_faststr
) hasn't created a frame yet, so it also doesn't have a frame, which is "frame 1" above.
Frame 1 should have its return address in growWork_faststr
, but somehow that is missing, and directly skip to mapassign_faststr
?
My guess is that a profiling signal lands in morestack
but before the stack switch. The jumpstack logic doesn't kick in as it hasn't switched. At this point PC=morestack, LR=caller of morestack, i.e. evacuate_faststr
. And it unwinds a frame normally.
At the second frame, nothing sets the LR specially, so at this point https://cs.opensource.google/go/go/+/master:src/runtime/traceback.go;l=228 it just loads the LR from the stack. But evacuate_faststr
hasn't actually saved LR on stack, so it loads the LR of its caller, which is the LR of growWork_faststr, which is
mapassgn_faststr`. I think that explains the missing frame. I think the logic of the code assumes only the last frame can be frameless (the n==0 check), all following frames have LR saved on stack, so it assumes it is safe to load. But the morestack-but-havevn't-switched case breaks that logic...
morestack
writes SP, so it should have SPWRITE bit set. If the jump stack logic doesn't kick in, it should stop here https://cs.opensource.google/go/go/+/master:src/runtime/traceback.go;l=201 . Unfortunate for profiling, but doesn't crash.
However, morestack_noctxt
doesn't write SP, so if a signal lands there, it tries to unwind normally... Maybe a stop-gap fix is to mark morestack_noctxt
SPWRITE...
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes, repros on go1.19.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Run continuous profiling via scraping /debug/pprof endpoints including mutex and blocking profiles, while performing a workload consuming from Kafka using Sarama and Klauspost Zstd libraries.
What did you expect to see?
No crashes.
What did you see instead?
The following crashes implicating
unexpected return pc for runtime.sigtramp
andtraceback stuck
(assert added in https://go-review.googlesource.com/c/go/+/400575/ by @cherrymui / #52116 which was backported to go1.18.5):