runtime: rare stack corruption on Go 1.20 and later

klauspost commented 6 months ago

Go version

Go 1.20.x and later

What operating system and processor architecture are you using (`go env`)?

Reported on linux/amd64 systems. Reported on both Go 1.20.x and 1.21.x builds.

We have limited access to customer systems.

What did you do?

We have at MinIO been experiencing runtime crashes since the release of Go 1.20

The issues only appear to occur for a very small number of our customers, and supplying them with a Go 1.19.x compiled binary always solves the issue.

The issue appear slightly different each time, but all indicate some sort of corruption. Since none of them had any "smoking gun", I held off on submitting an issue.

Here are some

samples of customer crashes (click to expand)

``` May 8 04:11:36 framin107 minio[1082653]: fatal error: unexpected signal during runtime execution May 8 04:11:36 framin107 minio[1082653]: panic during panic May 8 04:11:36 framin107 minio[1082653]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x0] May 8 04:11:36 framin107 minio[1082653]: runtime stack: May 8 04:11:36 framin107 minio[1082653]: runtime.throw({0x2b40d78?, 0x0?}) May 8 04:11:36 framin107 minio[1082653]: #011runtime/panic.go:1047 +0x5d fp=0xc0102f1ed8 sp=0xc0102f1ea8 pc=0x43933d May 8 04:11:36 framin107 minio[1082653]: runtime: g 0: unexpected return pc for runtime.sigpanic called from 0x0 ``` .. Same hardware: ``` May 10 10:17:18 framin107 minio[1097631]: fatal error: bulkBarrierPreWrite: unaligned arguments May 10 10:17:18 framin107 minio[1097631]: unexpected fault address 0x0 May 10 10:17:18 framin107 minio[1097631]: fatal error: fault May 10 10:17:18 framin107 minio[1097631]: goroutine 50957332 [running]: May 10 10:17:18 framin107 minio[1097631]: runtime: g 50957332: unexpected return pc for runtime.throw called from 0x0 May 10 10:17:18 framin107 minio[1097631]: stack: frame={sp:0xc01034b1b8, fp:0xc01034b1e8} stack=[0xc01034a000,0xc01034c000) ``` ``` May 10 03:53:32 framin107 minio[1090219]: fatal error: index out of range May 10 03:53:32 framin107 minio[1090219]: fatal error: index out of range May 10 03:53:32 framin107 minio[1090219]: fatal error: index out of range May 10 03:53:32 framin107 minio[1090219]: fatal: bad g in signal handler May 10 03:53:33 framin107 systemd[1]: minio.service: Main process exited, code=exited, status=2/INVALIDARGUMENT ``` Second customer: ``` Nov 27 02:36:00 minio minio[240609]: fatal error: unexpected signal during runtime execution Nov 27 02:36:00 minio minio[240609]: unexpected fault address 0x0 Nov 27 02:36:00 minio minio[240609]: fatal error: fault Nov 27 02:36:00 minio minio[240609]: fatal error: unexpected signal during runtime execution Nov 27 02:36:00 minio minio[240609]: fatal error: bulkBarrierPreWrite: unaligned arguments Nov 27 02:36:00 minio minio[240609]: runtime: pointer 0xc034ec65b0 to unallocated span span.base()=0xc034ec6000 span.limit=0xc034ec7e40 span.state=0 Nov 27 02:36:00 minio minio[240609]: runtime: found in object at *(0xc03939c580+0x28) Nov 27 02:36:00 minio minio[240609]: object=0xc03939c580 s.base()=0xc03939c000 s.limit=0xc03939de40 s.spanclass=58 s.elemsize=704 s.state=mSpanInUse Nov 27 02:36:00 minio minio[240609]: *(object+0) = 0x4baf5a0 Nov 27 02:36:00 minio minio[240609]: *(object+8) = 0xc01ec44a80 Nov 27 02:36:00 minio minio[240609]: *(object+16) = 0x4baf5a0 Nov 27 02:36:00 minio minio[240609]: *(object+24) = 0xc027728000 Nov 27 02:36:00 minio minio[240609]: *(object+32) = 0x732280 Nov 27 02:36:00 minio minio[240609]: *(object+40) = 0xc034ec65b0 <== Nov 27 02:36:00 minio minio[240609]: *(object+48) = 0xc03e43b7a0 Nov 27 02:36:00 minio minio[240609]: *(object+56) = 0xc022b526c0 Nov 27 02:36:00 minio minio[240609]: *(object+64) = 0x4b9e4f8 Nov 27 02:36:00 minio minio[240609]: *(object+72) = 0xc02f21a2c0 Nov 27 02:36:00 minio minio[240609]: *(object+80) = 0x0 Nov 27 02:36:00 minio minio[240609]: *(object+88) = 0x0 Nov 27 02:36:00 minio minio[240609]: *(object+96) = 0x23eafe0 Nov 27 02:36:00 minio minio[240609]: *(object+104) = 0xc035cb8c14 Nov 27 02:36:00 minio minio[240609]: *(object+112) = 0xc00147c000 Nov 27 02:36:00 minio minio[240609]: *(object+120) = 0xc02f21a340 Nov 27 02:36:00 minio minio[240609]: *(object+128) = 0x4b9df08 Nov 27 02:36:00 minio minio[240609]: *(object+136) = 0x62aa480 Nov 27 02:36:00 minio minio[240609]: *(object+144) = 0x1 Nov 27 02:36:00 minio minio[240609]: *(object+152) = 0x0 Nov 27 02:36:00 minio minio[240609]: *(object+160) = 0xc035c885b8 Nov 27 02:36:00 minio minio[240609]: *(object+168) = 0x4 Nov 27 02:36:00 minio minio[240609]: *(object+176) = 0xc035c885bd Nov 27 02:36:00 minio minio[240609]: *(object+184) = 0xd Nov 27 02:36:00 minio minio[240609]: *(object+192) = 0x4baf5a0 Nov 27 02:36:00 minio minio[240609]: *(object+200) = 0xc01ec44a80 Nov 27 02:36:00 minio minio[240609]: *(object+208) = 0x4baf5a0 Nov 27 02:36:00 minio minio[240609]: *(object+216) = 0xc0064520e0 Nov 27 02:36:00 minio minio[240609]: *(object+224) = 0xc030055712 Nov 27 02:36:00 minio minio[240609]: *(object+232) = 0x16 Nov 27 02:36:00 minio minio[240609]: *(object+240) = 0xc035c885b8 Nov 27 02:36:00 minio minio[240609]: *(object+248) = 0x12 Nov 27 02:36:00 minio minio[240609]: *(object+256) = 0x23eafe0 Nov 27 02:36:00 minio minio[240609]: *(object+264) = 0xc035cb8e40 Nov 27 02:36:00 minio minio[240609]: *(object+272) = 0xc00137bb80 Nov 27 02:36:00 minio minio[240609]: *(object+280) = 0xc02f21a480 Nov 27 02:36:00 minio minio[240609]: *(object+288) = 0xc035c88618 Nov 27 02:36:00 minio minio[240609]: *(object+296) = 0x4 Nov 27 02:36:00 minio minio[240609]: *(object+304) = 0xc035c8861d Nov 27 02:36:00 minio minio[240609]: *(object+312) = 0xd Nov 27 02:36:00 minio minio[240609]: *(object+320) = 0x4baf5a0 Nov 27 02:36:00 minio minio[240609]: *(object+328) = 0xc01ec44a80 Nov 27 02:36:00 minio minio[240609]: *(object+336) = 0x4baf5a0 Nov 27 02:36:00 minio minio[240609]: *(object+344) = 0xc0064520e0 Nov 27 02:36:00 minio minio[240609]: *(object+352) = 0xc0300557d3 Nov 27 02:36:00 minio minio[240609]: *(object+360) = 0x15 Nov 27 02:36:00 minio minio[240609]: *(object+368) = 0xc035c88618 Nov 27 02:36:00 minio minio[240609]: *(object+376) = 0x12 Nov 27 02:36:00 minio minio[240609]: *(object+384) = 0x23eafe0 Nov 27 02:36:00 minio minio[240609]: *(object+392) = 0xc035cb8f30 Nov 27 02:36:00 minio minio[240609]: *(object+400) = 0xc001e26780 Nov 27 02:36:00 minio minio[240609]: *(object+408) = 0xc02f21a580 Nov 27 02:36:00 minio minio[240609]: *(object+416) = 0x0 Nov 27 02:36:00 minio minio[240609]: *(object+424) = 0x0 Nov 27 02:36:00 minio minio[240609]: *(object+432) = 0xc035ce9320 Nov 27 02:36:00 minio minio[240609]: *(object+440) = 0xc035ce9330 Nov 27 02:36:00 minio minio[240609]: *(object+448) = 0x0 Nov 27 02:36:00 minio minio[240609]: *(object+456) = 0x0 Nov 27 02:36:00 minio minio[240609]: *(object+464) = 0xc035ce93f0 Nov 27 02:36:00 minio minio[240609]: *(object+472) = 0xc035ce9400 Nov 27 02:36:00 minio minio[240609]: *(object+480) = 0x0 Nov 27 02:36:00 minio minio[240609]: *(object+488) = 0x0 Nov 27 02:36:00 minio minio[240609]: *(object+496) = 0xc035ce94c0 Nov 27 02:36:00 minio minio[240609]: *(object+504) = 0xc035ce94d0 Nov 27 02:36:00 minio minio[240609]: *(object+512) = 0x4b9df08 Nov 27 02:36:00 minio minio[240609]: *(object+520) = 0x62aa480 Nov 27 02:36:00 minio minio[240609]: *(object+528) = 0x0 Nov 27 02:36:00 minio minio[240609]: *(object+536) = 0x0 Nov 27 02:36:00 minio minio[240609]: *(object+544) = 0xc035c887f8 Nov 27 02:36:00 minio minio[240609]: *(object+552) = 0x4 Nov 27 02:36:00 minio minio[240609]: *(object+560) = 0xc035c887fd Nov 27 02:36:00 minio minio[240609]: *(object+568) = 0xd Nov 27 02:36:00 minio minio[240609]: *(object+576) = 0xc035cb90f0 Nov 27 02:36:00 minio minio[240609]: *(object+584) = 0xa Nov 27 02:36:00 minio minio[240609]: *(object+592) = 0xc035cb90fb Nov 27 02:36:00 minio minio[240609]: *(object+600) = 0x3 Nov 27 02:36:00 minio minio[240609]: *(object+608) = 0x4baf5a0 Nov 27 02:36:00 minio minio[240609]: *(object+616) = 0xc01ec44a80 Nov 27 02:36:00 minio minio[240609]: *(object+624) = 0x4baf5a0 Nov 27 02:36:00 minio minio[240609]: *(object+632) = 0xc027728000 Nov 27 02:36:00 minio minio[240609]: *(object+640) = 0x23ae600 Nov 27 02:36:00 minio minio[240609]: *(object+648) = 0x4ba47f0 Nov 27 02:36:00 minio minio[240609]: *(object+656) = 0xc034ecedc0 Nov 27 02:36:00 minio minio[240609]: *(object+664) = 0xc018072b90 Nov 27 02:36:00 minio minio[240609]: *(object+672) = 0x23eafe0 Nov 27 02:36:00 minio minio[240609]: *(object+680) = 0xc035cb9150 Nov 27 02:36:00 minio minio[240609]: *(object+688) = 0xc001599180 Nov 27 02:36:00 minio minio[240609]: *(object+696) = 0xc02f21a800 Nov 27 02:36:00 minio minio[240609]: fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?) ``` Third customer: ``` Aug 10 12:02:06 minio minio[13162]: fatal error: unexpected signal during runtime execution Aug 10 12:02:06 minio minio[13162]: unexpected fault address 0x0 Aug 10 12:02:06 minio minio[13162]: fatal error: fault Aug 10 12:02:06 minio minio[13162]: runtime: pointer 0xc0228c4a20 to unused region of span span.base()=0xc0141cc000 span.limit=0xc014Aug 10 12:02:06 minio minio[13162]: runtime: found in object at *(0xc020089980+0x10) Aug 10 12:02:06 minio minio[13162]: object=0xc020089980 s.base()=0xc020088000 s.limit=0xc020089fe0 s.spanclass=16 s.elemsize=96 s.staAug 10 12:02:06 minio minio[13162]: *(object+0) = 0x2b3b90b Aug 10 12:02:06 minio minio[13162]: *(object+8) = 0x4 Aug 10 12:02:06 minio minio[13162]: *(object+16) = 0xc0228c4a20 <== Aug 10 12:02:06 minio minio[13162]: *(object+24) = 0x8b Aug 10 12:02:06 minio minio[13162]: *(object+32) = 0x4d3c400 Aug 10 12:02:06 minio minio[13162]: *(object+40) = 0x61953a0 Aug 10 12:02:06 minio minio[13162]: *(object+48) = 0x2309720 Aug 10 12:02:06 minio minio[13162]: *(object+56) = 0xc12d4b44a6b3921a Aug 10 12:02:06 minio minio[13162]: *(object+64) = 0xafc4c36fd829a Aug 10 12:02:06 minio minio[13162]: *(object+72) = 0x6301700 Aug 10 12:02:06 minio minio[13162]: *(object+80) = 0x6380620 Aug 10 12:02:06 minio minio[13162]: *(object+88) = 0x5 Aug 10 12:02:06 minio minio[13162]: fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?) ```

It does however seem like the "Thanos" project has supplied what looks to be smoking gun. (issue contains additional traces)

So the crash occurs in a goroutine that compresses a block.

The only really interesting thing that goes on is that calls assembler to compress the block. Looking at the assembly caller it is pretty straightforward. This is the disassembly for the caller:

github.com/klauspost/compress/s2.encodeBlockBetter STEXT size=465 args=0x30 locals=0x40 funcid=0x0 align=0x0
    0x0000 00000 (/s2/encode_amd64.go:52)   TEXT    github.com/klauspost/compress/s2.encodeBlockBetter(SB), ABIInternal, $64-48
    0x0000 00000 (/s2/encode_amd64.go:52)   CMPQ    SP, 16(R14)
    0x0004 00004 (/s2/encode_amd64.go:52)   PCDATA  $0, $-2
    0x0004 00004 (/s2/encode_amd64.go:52)   JLS 395
    0x000a 00010 (/s2/encode_amd64.go:52)   PCDATA  $0, $-1
    0x000a 00010 (/s2/encode_amd64.go:52)   PUSHQ   BP
    0x000b 00011 (/s2/encode_amd64.go:52)   MOVQ    SP, BP
    0x000e 00014 (/s2/encode_amd64.go:52)   SUBQ    $56, SP
    0x0012 00018 (/s2/encode_amd64.go:52)   MOVQ    AX, github.com/klauspost/compress/s2.dst+72(FP)
    0x0017 00023 (/s2/encode_amd64.go:52)   MOVQ    DI, github.com/klauspost/compress/s2.src+96(FP)
    0x001c 00028 (/s2/encode_amd64.go:52)   FUNCDATA    $0, gclocals·cNGUyZq94N9QFR70tEjj5A==(SB)
    0x001c 00028 (/s2/encode_amd64.go:52)   FUNCDATA    $1, gclocals·J5F+7Qw7O7ve2QcWC7DpeQ==(SB)
    0x001c 00028 (/s2/encode_amd64.go:52)   FUNCDATA    $5, github.com/klauspost/compress/s2.encodeBlockBetter.arginfo1(SB)
    0x001c 00028 (/s2/encode_amd64.go:52)   FUNCDATA    $6, github.com/klauspost/compress/s2.encodeBlockBetter.argliveinfo(SB)
    0x001c 00028 (/s2/encode_amd64.go:52)   PCDATA  $3, $1
    0x001c 00028 (/s2/encode_amd64.go:52)   NOP
    0x0020 00032 (/s2/encode_amd64.go:62)   CMPQ    SI, $4194304
    0x0027 00039 (/s2/encode_amd64.go:62)   JGT 337
[...]
    0x0151 00337 (/s2/encode_amd64.go:63)   MOVQ    AX, (SP)
    0x0155 00341 (/s2/encode_amd64.go:63)   MOVQ    BX, 8(SP)
    0x015a 00346 (/s2/encode_amd64.go:63)   MOVQ    CX, 16(SP)
    0x015f 00351 (/s2/encode_amd64.go:63)   MOVQ    DI, 24(SP)
    0x0164 00356 (/s2/encode_amd64.go:63)   MOVQ    SI, 32(SP)
    0x0169 00361 (/s2/encode_amd64.go:63)   MOVQ    R8, 40(SP)
    0x016e 00366 (/s2/encode_amd64.go:63)   CALL    github.com/klauspost/compress/s2.encodeBetterBlockAsm(SB)
    0x0173 00371 (/s2/encode_amd64.go:63)   XORPS   X15, X15
    0x0177 00375 (/s2/encode_amd64.go:63)   MOVQ    (TLS), R14
    0x0180 00384 (/s2/encode_amd64.go:63)   MOVQ    48(SP), AX
    0x0185 00389 (/s2/encode_amd64.go:63)   ADDQ    $56, SP
    0x0189 00393 (/s2/encode_amd64.go:63)   POPQ    BP
    0x018a 00394 (/s2/encode_amd64.go:63)   RET
[...]
    0x018b 00395 (/s2/encode_amd64.go:52)   MOVQ    AX, 8(SP)
    0x0190 00400 (/s2/encode_amd64.go:52)   MOVQ    BX, 16(SP)
    0x0195 00405 (/s2/encode_amd64.go:52)   MOVQ    CX, 24(SP)
    0x019a 00410 (/s2/encode_amd64.go:52)   MOVQ    DI, 32(SP)
    0x019f 00415 (/s2/encode_amd64.go:52)   MOVQ    SI, 40(SP)
    0x01a4 00420 (/s2/encode_amd64.go:52)   MOVQ    R8, 48(SP)
    0x01a9 00425 (/s2/encode_amd64.go:52)   CALL    runtime.morestack_noctxt(SB)
    0x01ae 00430 (/s2/encode_amd64.go:52)   MOVQ    8(SP), AX
    0x01b3 00435 (/s2/encode_amd64.go:52)   MOVQ    16(SP), BX
    0x01b8 00440 (/s2/encode_amd64.go:52)   MOVQ    24(SP), CX
    0x01bd 00445 (/s2/encode_amd64.go:52)   MOVQ    32(SP), DI
    0x01c2 00450 (/s2/encode_amd64.go:52)   MOVQ    40(SP), SI
    0x01c7 00455 (/s2/encode_amd64.go:52)   MOVQ    48(SP), R8
    0x01cc 00460 (/s2/encode_amd64.go:52)   PCDATA  $0, $-1
    0x01cc 00460 (/s2/encode_amd64.go:52)   JMP 0

This is the assembly function called signature:

// func encodeBetterBlockAsm(dst []byte, src []byte) int
// Requires: BMI, SSE2
TEXT ·encodeBetterBlockAsm(SB), $589848-56

... and the function definition:

//go:noescape
func encodeBetterBlockAsm(dst []byte, src []byte) int

The reason I included the stack size check is that it uses quite a bit of stack, so there is a chance that it is called.

The stack is used for a dynamic lookup table. I am fairly sure there are no writes outside the stack, and I also am pretty confident there are no writes outside the provided slices (this would likely also give different errors).

I do not use the BP register - so it is not clobbered, and only SSE2 registers are used - so no VZEROUPPER weirdness. The stack is managed by avo, so less likely there is a bug with that.

So my questions are:

A) Am I doing something obviously wrong? B) What would be a typical reason for this error to show up? C) This seems releated to GC, so is there a window where the goroutine could be preempted in an unsafe state? D) Are there any Go 1.20 changes that seem likely to be triggering this?

Keep in mind that this doesn't appear to happen on too many machines. Talos reported that it seem to happen more often if a lot of memory is allocated.

I will of course assist with any information that may be needed - but I feel at this point I need some pointers from people deeper understanding of the runtime to get much further.

Also note we have tested CPU+RAM on some of these customer systems extensively since that seemed like a possibility at first. Also note that crashes can be completely unrelated - but the coincidence seems to big.

What did you expect to see?

No crash

What did you see instead?

Rare, random runtime crashes

Edit: Assembly is now linux compiled.

mauri870 commented 6 months ago

cc @golang/runtime

thepudds commented 6 months ago

Hi @klauspost, others will likely have more useful comments, but two initial questions that spring to mind — are you able to gather exact kernel versions for different machines where this is observed, and is there an opportunity to try some triage debug settings (for example, along the lines of those outlined in https://github.com/golang/go/issues/58212#issuecomment-1412216775)?

jnyi commented 6 months ago

hi go Team,

thanks for looking into this, I've able to reproduce it (rarely), here is what kernel used right now:

/ $ uname -r
5.4.0-1113-aws-fips
/ $ cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.15.11
PRETTY_NAME="Alpine Linux v3.15"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"
/ $ cat /proc/version
Linux version 5.4.0-1113-aws-fips (buildd@lcy02-amd64-015) (gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)) #123+fips1-Ubuntu SMP Thu Oct 19 16:21:22 UTC 2023

klauspost commented 6 months ago

Customer 1: kernelVersion: 4.14.111-1.el7.centos.x86_64 - centos/7.6.1810 Customer 3: kernelVersion: 4.18.0-425.13.1.el8_7.x86_64 rhel/8.7 Don't have the information for the final customer.

CGO disabled for build FWIW.

Updated disassembly. Was accidentally Windows compiled. Now shows Linux.

dctrwatson commented 6 months ago

Kernel: 5.15.0-1051-aws #56~20.04.1-Ubuntu SMP Tue Nov 28 15:43:31 UTC 2023 x86_64 GNU/Linux Go: go1.21.3

egonelbre commented 6 months ago

Two things where I might make mistakes at:

One thing I did encounter was that races aren't instrumented in assembly; so if you have a data-race that may cause some pain. Annotate your assembly (or anything that might call assembly in stdlib) with runtime.RaceReadRange or runtime.RaceWriteRange. Alternatively, replace the code with equivalent Go code.

The other is that assembly code might write outside of it's bounds -- in which case, write exhaustive tests across some range of data sizes / offsets, with memory protected byte slices.

klauspost commented 6 months ago

@egonelbre As I wrote, I am pretty sure this isn't a buffer write issue. I have fuzz tests in place that already checks out-of-buffer writes - allocating a bigger buffer, writing random data there - and then truncates the capacity in the buffer given to compression as destination. And after the compression checking if the data after the buffer has been touched.

Since it (to me at least) appears to be a stack issue, I don't think there are races going on, since of course stacks are never shared. I do run all tests through the race detector as part of CI tests. However instrumenting sounds like a good idea anyway - do you have any doc/examples? Google doesn't really turn up anything useful.

egonelbre commented 6 months ago

Here's a quick demo https://github.com/egonelbre/compress/commit/c7df670870dc509127572281d79075a7faac8f13. Instrumenting around buffers and pools is also a good idea.

dctrwatson commented 6 months ago

I added GODEBUG="gccheckmark=1,gcshrinkstackoff=1,asyncpreemptoff=1" and we have not had a panic in >24h. We used to see at least a couple per hour.

thepudds commented 6 months ago

@dctrwatson, that's great, thanks for trying that! What do you think about trying those one at a time, maybe in the reverse order they are listed in https://github.com/golang/go/issues/58212#issuecomment-1412216775?

dctrwatson commented 6 months ago

GODEBUG="asyncpreemptoff=1" resulted in multiple panics of:

fatal error: workbuf is empty
fatal: bad g in signal handler

Switching to next flag

klauspost commented 6 months ago

So looking at the generated assembly one things that sticks out is that SP and BP is kept as-is, even though compiled functions generate:

        PUSHQ   BP
        MOVQ    SP, BP
        SUBQ    $56, SP
...
        ADDQ    $56, SP
        POPQ    BP
        RET

Since I don't know anything about the GC preemption it is hard for me to see if this is a problem with my code or the runtime.

cherrymui commented 6 months ago

So looking at the generated assembly one things that sticks out is that SP and BP is kept as-is

What generated assembly, in which function? Thanks.

klauspost commented 6 months ago

@cherrymui The avo generated assembly - example - meaning the assembly being called and presumably pre-empted.

cherrymui commented 6 months ago

Oh, that is fine. The Go assembler itself will insert prologue and epilogue to adjust SP and BP, as long as the frame size (589848 in TEXT ·encodeBetterBlockAsm(SB), $589848-56) is declared correctly. These instructions don't need to (and should not) be written out in assembly source code. If you look at the generated binary (e.g. using go tool objdump), it will have those instructions.

klauspost commented 6 months ago

Then unfortunately that isn't where the problem lies.

dctrwatson commented 6 months ago

GODEBUG=gcshrinkstackoff=1 hasn't had any panics in 20h

I'll switch to gccheckmark=1 to confirm it has no effect (as-in, still get panics)

dctrwatson commented 6 months ago

Well that was fast... got one already with GODEBUG=gccheckmark=1

unexpected fault address 0x0
fatal error: fault

mauri870 commented 6 months ago

Interesting finding!

I see that copystack says https://go.googlesource.com/go/+/adec22b9f7c9d9cfb95ff6af1c63ec489d6e9bb8/src/runtime/stack.go#853

Caller must have changed gp status to Gcopystack.

But I see here that we are not changing the status to Gcopystack https://go.googlesource.com/go/+/adec22b9f7c9d9cfb95ff6af1c63ec489d6e9bb8/src/runtime/stack.go#1214

Can it be the case? If it is not in _Gcopystack the concurrent gc might try to scan the stack while we are doing the copy?

cc @cherrymui

cherrymui commented 6 months ago

@mauri870 Thanks for catching that. I agree that it seems inconsistent with the comment. But I don't obviously see what is wrong. The G (and its stack) is either owned by the scan bit, or it is the running G itself. So other goroutines shouldn't be doing anything with its stack. Also, https://github.com/golang/go/issues/64781#issuecomment-1864956541 seems to suggest that stack shrinking isn't the cause. That said, if you'd like, you can try if this patch makes any difference.

diff --git a/src/runtime/stack.go b/src/runtime/stack.go
index 61cd0a0fdd..6d4c6ea485 100644
--- a/src/runtime/stack.go
+++ b/src/runtime/stack.go
@@ -1161,6 +1161,7 @@ func shrinkstack(gp *g) {
    if gp.stack.lo == 0 {
        throw("missing stack in shrinkstack")
    }
+   var selfShrink bool
    if s := readgstatus(gp); s&_Gscan == 0 {
        // We don't own the stack via _Gscan. We could still
        // own it if this is our own user G and we're on the
@@ -1169,6 +1170,7 @@ func shrinkstack(gp *g) {
            // We don't own the stack.
            throw("bad status in shrinkstack")
        }
+       selfShrink = true
    }
    if !isShrinkStackSafe(gp) {
        throw("shrinkstack at bad time")
@@ -1211,7 +1213,13 @@ func shrinkstack(gp *g) {
        print("shrinking stack ", oldsize, "->", newsize, "\n")
    }

+   if selfShrink {
+       casgstatus(gp, _Grunning, _Gcopystack)
+   }
    copystack(gp, newsize)
+   if selfShrink {
+       casgstatus(gp, _Gcopystack, _Grunning)
+   }
 }

 // freeStackSpans frees unused stack spans at the end of GC.

mauri870 commented 6 months ago

From what I understood, setting GODEBUG=gcshrinkstackoff=1 did fix the issue, so taking this short path where we don't call copystack seemed to prevent the stack corruption.

This paired with the comment that we should be changing the status to _Gcopystack raised some eyebrows.

Plus this comment in the other usage of copystack:

// The concurrent GC will not scan the stack while we are doing the copy since // the gp is in a Gcopystack status.

cherrymui commented 6 months ago

Okay, I misread the comment https://github.com/golang/go/issues/64781#issuecomment-1864956541 -- it is gccheckmark, not gcshrinkstackoff... Then try if that patch makes any difference. Thanks.

mauri870 commented 6 months ago

@dctrwatson If you are able to, can you compile your binary with a custom Go toolchain with cherrymui's patch and then run the program without any GODEBUG env set to see if that fixes the issue? Thanks

klauspost commented 6 months ago

@dctrwatson As far as I can tell the workflow would be:

1) Locate go installation folder (for example via whereis go) 2) Go to ../src/runtime/stack.go and apply the patch. 3) Recompile the crashing application. 4) Deploy application.

AFAICT it shouldn't be necessary to recompile the go compiler, but someone can correct me if I am wrong.

cherrymui commented 6 months ago

You're right that there is no need to recompile the Go compiler. You can also find your Go installation by running go env GOROOT.

cherrymui commented 6 months ago

I'm still not sure what could go wrong. For the concurrent GC to scan its stack, it needs to preempt the goroutine. But if we're copying the stack, we cannot be preempted until the copying is done.

klauspost commented 6 months ago

I don't know if it is relevant. At least 2/3 (and probably also the third) of our customers that have dual-CPU systems - it may however not statistically significant, since probably about 40% of our customers use that.

It seems to be somewhat specific for certain machines to be "vulnerable" to this. If it was something that would happen on all installs, I'd say it would happen a lot more than what we are observing.

mknyszek commented 6 months ago

If you're OK with log spew in the name of debugging, it might also be worthwhile to set stackDebug = 1 in src/runtime/stack.go. We may also want to patch in a few lines that prints the top frames of the stack in the stack shrinking case, before the shrink happens.

If this is some stack-shrink-at-a-bad-time bug then we should at least be able to identify which stack shrink messed things up. And if we can get exactly when that stack shrink happened (i.e. what the stack looked like at the time of the shrink), then that might give us a smoking gun.

smiletrl commented 6 months ago

For this patch https://github.com/golang/go/issues/64781#issuecomment-1865118335, go1.19 https://github.com/golang/go/blob/release-branch.go1.19/src/runtime/stack.go#L1160 doesn't have such patch, and it's same as go1.20 https://github.com/golang/go/blob/release-branch.go1.20/src/runtime/stack.go#L1162.

For this error, fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?), it looks like one pointer in heap refers to illegal pointer, which error comes from https://github.com/golang/go/blob/release-branch.go1.20/src/runtime/mbitmap.go#L288.

This error happens at 1.19 https://github.com/golang/go/issues/56552, 1.14 https://github.com/golang/go/issues/37968 too.

In this issue's case,

The issues only appear to occur for a very small number of our customers

It seems to be somewhat specific for certain machines to be "vulnerable" to this. If it was something that would happen on all installs, I'd say it would happen a lot more than what we are observing.

and https://go.dev/doc/go1.20

The most important effect of the default change is that when Go is installed on a system without a C compiler, it will now use pure Go builds for packages in the standard library that use cgo, instead of using pre-distributed package archives (which have been removed, as noted above) or attempting to use cgo and failing.

Is it possible customers using different systems with/without C compilers leading this error ^?

dctrwatson commented 6 months ago

@mauri870, I followed the steps by @klauspost in https://github.com/golang/go/issues/64781#issuecomment-1866590541 I used go version go1.21.5 linux/amd64 to build this time.

After couple hours got a panic though:

unexpected fault address 0x0
fatal error: fault
fatal: bad g in signal handler

Perhaps as @mknyszek suggested, we can add some debug prints. This way I can also validate I'm building properly with the patch applied.

smiletrl commented 6 months ago

@klauspost @dctrwatson hey, do you guys have some minimal code structure sample to setup in a local env to reproduce this error?

dctrwatson commented 6 months ago

@klauspost @dctrwatson hey, do you guys have some minimal code structure sample to setup in a local env to reproduce this error?

This is with Thanos v0.32.5 running the receiver component in router mode. So unfortunately, I cannot provide one, sorry.

klauspost commented 6 months ago

@smiletrl As I wrote it seems to only trigger on certain setups. Unfortunately these are all production setups running on customer hardware to which we don't have access.

vser1 commented 6 months ago

We are also seeing this corruption issue on our production, building with go 1.20.11, using Klauspost v1.17.3 on Debian 10 running kernel 5.10.0-25-amd64 on dual CPU Intel Xeon Gold 6348/6230R.

mknyszek commented 5 months ago

@klauspost This ended up back in the compiler/runtime triage queue, do you have any updates from your end on trying it out with debug settings enabled?

klauspost commented 5 months ago

@mknyszek We also see the issue being worked around with the GODEBUG=gcshrinkstackoff=1. We've seem the issue at one or two more customers. I will see if I can find more common things. Since all have been on production systems we have not been able to provide a custom compile with the proposed path.

It was shortly looking like "Intel SDE" could reproduce the issue, but it turned out that it was probably just emulation quirks. So I still don't have a reliable reproducer.

It does match quite well with the code using quite a bit of stack. Probably more than most other code out there.

karsov commented 5 months ago

Hi all, I want to add one more case to the issue with some additional details if it helps for the debugging. In our case we are also running Thanos (v0.33) on K8s on GCP AMD instances. We were seeing restarts (fatal errors) every couple of hours only in the region where we have the highest traffic. The restarts disappeared completely when we set the GODEBUG=gcshrinkstackoff=1 environment variable.

Here are the last log lines in 3 different occasions of fatal errors (without GODEBUG):

Pod 1:

fatal error: workbuf is empty
fatal: bad g in signal handler

Pod 2:

runtime: pointer 0xc00a685fd0 to unallocated span span.base()=0xc00265a000 span.limit=0xc00265bf80 span.state=0
runtime: found in object at *(0xc00228eb40+0x88)
object=0xc00228eb40 s.base()=0xc00228e000 s.limit=0xc00228ff80 s.spanclass=38 s.elemsize=288 s.state=mSpanInUse
 *(object+0) = 0x0
 *(object+8) = 0x0
 *(object+16) = 0x0
 *(object+24) = 0x0
 *(object+32) = 0x0
 *(object+40) = 0x0
 *(object+48) = 0x0
 *(object+56) = 0xffffffffffffffff
 *(object+64) = 0x0
 *(object+72) = 0x0
 *(object+80) = 0x0
 *(object+88) = 0xc00a685fa8
 *(object+96) = 0x44529c
 *(object+104) = 0xc000a36b60
 *(object+112) = 0x0
 *(object+120) = 0x0
 *(object+128) = 0x0
 *(object+136) = 0xc00a685fd0 <==
 *(object+144) = 0x0
 *(object+152) = 0x4891a5
 *(object+160) = 0xc00a685fe0
 *(object+168) = 0x0
 *(object+176) = 0x6
 *(object+184) = 0x1b02e04
 *(object+192) = 0xc0018b7520
 *(object+200) = 0x0
 *(object+208) = 0x1000000000000
 *(object+216) = 0x6500000000
 *(object+224) = 0x0
 *(object+232) = 0x0
 *(object+240) = 0x0
 *(object+248) = 0x0
 *(object+256) = 0x0
 *(object+264) = 0x0
 *(object+272) = 0x0
 *(object+280) = 0x0
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

Pod 3:

runtime: pointer 0xc000557180 to unallocated span span.base()=0xc000556000 span.limit=0xc000558000 span.state=0
runtime: found in object at *(0xc00122dd10+0x10)
object=0xc00122dd10 s.base()=0xc00122c000 s.limit=0xc00122dfe0 s.spanclass=10 s.elemsize=48 s.state=mSpanInUse
 *(object+0) = 0x0
 *(object+8) = 0x0
 *(object+16) = 0xc000557180 <==
 *(object+24) = 0x1b
 *(object+32) = 0x1b
 *(object+40) = 0xc0032657d0
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)
fatal: bad g in signal handler

I hope this helps.

aclements commented 5 months ago

@karsov , thanks! Are you able to share the stack tracebacks as well? You can strip out the symbols, paths, and arguments, though if you could keep the runtime functions that's probably helpful. I mostly want to see the goroutines and the sp/fp printed on each frame.

karsov commented 5 months ago

That's the strange part (at least for me). There are absolutely no stack traces, nor other log lines with timestamps close to the errors I already shared.

aclements commented 5 months ago

Oh, hmm. I guess I would expect there to be no stack trace for the "fatal: bad g in signal handler" ones. I would have expected a traceback from pod 2, though.

The fact that two of them are printing "fatal: bad g in signal handler" is really strange. We don't manipulate workbufs or scan stacks from signal handlers. This suggests we somehow 1) crashed for one reason and then 2) somehow set g to nil, which would immediately cause a signal and print that message. I don't see how we get from 1 to 2.

guozhiheng0123 commented 3 months ago

Hi all, I want to add one more case to the issue with some additional details if it helps for the debugging. In our case we are also running Thanos (v0.33) on K8s on GCP AMD instances. We were seeing restarts (fatal errors) every couple of hours only in the region where we have the highest traffic. The restarts disappeared completely when we set the GODEBUG=gcshrinkstackoff=1 environment variable.

Here are the last log lines in 3 different occasions of fatal errors (without GODEBUG):

Pod 1:

fatal error: workbuf is empty
fatal: bad g in signal handler

Pod 2:

runtime: pointer 0xc00a685fd0 to unallocated span span.base()=0xc00265a000 span.limit=0xc00265bf80 span.state=0
runtime: found in object at *(0xc00228eb40+0x88)
object=0xc00228eb40 s.base()=0xc00228e000 s.limit=0xc00228ff80 s.spanclass=38 s.elemsize=288 s.state=mSpanInUse
 *(object+0) = 0x0
 *(object+8) = 0x0
 *(object+16) = 0x0
 *(object+24) = 0x0
 *(object+32) = 0x0
 *(object+40) = 0x0
 *(object+48) = 0x0
 *(object+56) = 0xffffffffffffffff
 *(object+64) = 0x0
 *(object+72) = 0x0
 *(object+80) = 0x0
 *(object+88) = 0xc00a685fa8
 *(object+96) = 0x44529c
 *(object+104) = 0xc000a36b60
 *(object+112) = 0x0
 *(object+120) = 0x0
 *(object+128) = 0x0
 *(object+136) = 0xc00a685fd0 <==
 *(object+144) = 0x0
 *(object+152) = 0x4891a5
 *(object+160) = 0xc00a685fe0
 *(object+168) = 0x0
 *(object+176) = 0x6
 *(object+184) = 0x1b02e04
 *(object+192) = 0xc0018b7520
 *(object+200) = 0x0
 *(object+208) = 0x1000000000000
 *(object+216) = 0x6500000000
 *(object+224) = 0x0
 *(object+232) = 0x0
 *(object+240) = 0x0
 *(object+248) = 0x0
 *(object+256) = 0x0
 *(object+264) = 0x0
 *(object+272) = 0x0
 *(object+280) = 0x0
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

Pod 3:

runtime: pointer 0xc000557180 to unallocated span span.base()=0xc000556000 span.limit=0xc000558000 span.state=0
runtime: found in object at *(0xc00122dd10+0x10)
object=0xc00122dd10 s.base()=0xc00122c000 s.limit=0xc00122dfe0 s.spanclass=10 s.elemsize=48 s.state=mSpanInUse
 *(object+0) = 0x0
 *(object+8) = 0x0
 *(object+16) = 0xc000557180 <==
 *(object+24) = 0x1b
 *(object+32) = 0x1b
 *(object+40) = 0xc0032657d0
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)
fatal: bad g in signal handler

I hope this helps.

@karsov We also encountered the same error, can you tell me how to set GODEBUG=gcshrinkstackoff=1 environment variable in k8s deployment? thx

klauspost commented 3 months ago

If someone has the possibility to test the latest release without the GODEBUG switch.

v1.17.7 contains an OOB write fix. It isn't on the stack however, so I am unsure if it resolves this issue. We have been unable to confirm either way internally.

eustrainLee commented 1 month ago

I have a service run on 3 hosts. The panic occurred about every day. And I set GODEBUG=gcshrinkstackoff=1, It happened again 14 days later. So I think there are some other reasons.

RodrigoMenezes-Vantage commented 2 weeks ago

@klauspost I just happen to be running a version of thanos (which I believe is suffering from this issue) that uses v1.17.7 but just had the crash happen:

runtime: g42867214: frame.sp=0xc0020c1e50 top=0xc0020c1fe0
    stack=[0xc002042000-0xc0020c2000
fatal error: traceback did not unwind completely

runtime stack:
runtime.throw({0x28d4031?, 0x0?})
    /opt/bitnami/go/src/runtime/panic.go:1077 +0x5c fp=0xc000295d40 sp=0xc000295d10 pc=0x43b7dc
runtime.(*unwinder).finishInternal(0x0?)
    /opt/bitnami/go/src/runtime/traceback.go:568 +0x12a fp=0xc000295d80 sp=0xc000295d40 pc=0x46210a
runtime.(*unwinder).next(0xc000295e28?)
    /opt/bitnami/go/src/runtime/traceback.go:449 +0x235 fp=0xc000295df8 sp=0xc000295d80 pc=0x461f15
runtime.addOneOpenDeferFrame.func1()
    /opt/bitnami/go/src/runtime/panic.go:648 +0x85 fp=0xc000295fc8 sp=0xc000295df8 pc=0x43a985
runtime.systemstack()
    /opt/bitnami/go/src/runtime/asm_amd64.s:509 +0x4a fp=0xc000295fd8 sp=0xc000295fc8 pc=0x46fb8a

goroutine 42867214 [running]:
runtime.systemstack_switch()
    /opt/bitnami/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0020c1d60 sp=0xc0020c1d50 pc=0x46fb28
runtime.addOneOpenDeferFrame(0x0?, 0x0?, 0x0?)
    /opt/bitnami/go/src/runtime/panic.go:645 +0x65 fp=0xc0020c1da0 sp=0xc0020c1d60 pc=0x43a8a5
panic({0x23fd040?, 0x4cb1710?})
    /opt/bitnami/go/src/runtime/panic.go:874 +0x14a fp=0xc0020c1e50 sp=0xc0020c1da0 pc=0x43b14a
runtime.panicmem(...)
    /opt/bitnami/go/src/runtime/panic.go:261
runtime.sigpanic()
    /opt/bitnami/go/src/runtime/signal_unix.go:861 +0x378 fp=0xc0020c1eb0 sp=0xc0020c1e50 pc=0x4527f8
created by github.com/klauspost/compress/s2.(*Writer).write in goroutine 11220
    /bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/klauspost/compress@v1.17.7/s2/writer.go:509 +0xb5

goroutine 1 [chan receive, 1049 minutes]:
runtime.gopark(0xc000909b40?, 0x410845?, 0xc0?, 0x20?, 0x20?)
    /opt/bitnami/go/src/runtime/proc.go:398 +0xce fp=0xc0006d3ad8 sp=0xc0006d3ab8 pc=0x43e6ce
runtime.chanrecv(0xc000767e00, 0xc000909bd8, 0x1)
    /opt/bitnami/go/src/runtime/chan.go:583 +0x3cd fp=0xc0006d3b50 sp=0xc0006d3ad8 pc=0x409c2d
runtime.chanrecv1(0xc000909be8?, 0x9?)
    /opt/bitnami/go/src/runtime/chan.go:442 +0x12 fp=0xc0006d3b78 sp=0xc0006d3b50 pc=0x409832
github.com/oklog/run.(*Group).Run(0xc000628b70)
    /bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/oklog/run@v1.1.0/group.go:43 +0x155 fp=0xc0006d3bf8 sp=0xc0006d3b78 pc=0x626e95
main.main()
    /bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/main.go:169 +0x19f8 fp=0xc0006d3f40 sp=0xc0006d3bf8 pc=0x20bb018
runtime.main()
    /opt/bitnami/go/src/runtime/proc.go:267 +0x2bb fp=0xc0006d3fe0 sp=0xc0006d3f40 pc=0x43e25b
runtime.goexit()
    /opt/bitnami/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0006d3fe8 sp=0xc0006d3fe0 pc=0x471981

Kernel Version: 6.1.91-99.172.amzn2023.x86_64 OS Image: Amazon Linux 2023.4.20240528 Operating System: linux Architecture: amd64

klauspost commented 1 week ago

@RodrigoMenezes-Vantage Thanks!

The release above is compiled with Go 1.21.10

goroutine 42867214 is the goroutine, that uses 327704 bytes of stack in this call

FWIW, we have not observed this issue at MinIO for quite a while. If it isn't the fix above (which also makes sense, since it would be heap rather than stack), it could be that Go 1.22 fixes (we switched pretty close after release), but that is of course just speculation.

With thousands of deploys it seems to be something machine specific. Especially since it seems to be the same machines reporting it - and semi-reliable being able to reproduce.

This is of course not a great situation. Is there anything I can try to move this towards a resolution?

DemiMarie commented 1 day ago

@klauspost: Can you check the CPU versions (including steppings!) of machines where it crashes and doesn’t crash and see if the crashing machines are always running certain CPUs?

The reason I ask is that while CPU bugs are rare, they are not unheard of, and they can have completely unpredictable consequences, up to and including being able to escalate privileges. Furthermore, these bugs can require very specific conditions to trigger, so it is quite possible that generic CPU tests did not trigger the bug. This issue reminds me of https://gallium.inria.fr/blog/intel-skylake-bug/, which turned out to be an errata in Intel Skylake and Kaby Lake parts.

DemiMarie commented 1 day ago

@klauspost: If your customers are using hyperthreading, can you ask them to disable hyperthreading and see if the crashes go away?

Edit: another thing to check would be if there is a microcode update available, and if so, if that solves the problem.

klauspost commented 1 day ago

I have added the CPU info from 3 crashing systems. Most (if not all) are virtualized so CPUID can be unreliable.

Also this is their current (or close to current) microcode level. These crashes happened almost a year ago, so microcode levels may not reflect the level at the time of the crash. We do not have any access to these systems.

---- SEE CPU INFO ----

Case 1: ``` { "cache_size": 25600, "cores": 4, "family": "6", "mhz": 2399.998, "microcode": "0xb000040", "model": "79", "model_name": "Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz", "physical_id": "0", "stepping": 1, "vendor_id": "GenuineIntel" }, ``` Case 2: ``` { "vendor_id": "GenuineIntel", "family": "6", "model": "79", "stepping": 1, "physical_id": "0", "model_name": "Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz", "mhz": 3300, "cache_size": 35840, "microcode": "0xb000040", "cores": 28 }, ``` Case 3 - these are the 2 CPU variants in the cluster - I cannot tell on which the crash happened. But both have similar microcode version: ``` { "vendor_id": "GenuineIntel", "family": "6", "model": "106", "stepping": 6, "physical_id": "1", "model_name": "Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz", "mhz": 3600, "cache_size": 36864, "microcode": "0xd000375", "cores": 48 } // OR { "vendor_id": "GenuineIntel", "family": "6", "model": "106", "stepping": 6, "physical_id": "0", "model_name": "Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz", "mhz": 3200, "cache_size": 49152, "microcode": "0xd000375", "cores": 64 }, ```

All 3 are different customers. Another crash appears to have happened on an AMD EPYC system But I was unable to confirm this.

Note that all of these are from externally running systems, so I have very limited access to information. In all cases only one of several identical servers would see the crashes happen - but at a regular frequency.

But all happened at Go 1.19 -> 1.20 compiler upgrade and was either solved by a similar build with Go 1.19 or GODEBUG=gcshrinkstackoff=1. As I also said we have not recently seen this issue, but it appears others above have.

DemiMarie commented 1 day ago

@klauspost:

In all cases only one of several identical servers would see the crashes happen - but at a regular frequency.

Were these servers running on identical hosts?

klauspost commented 1 day ago

@DemiMarie If you mean VM hosts, I heavily expect so, but have no solid proof. The hardware itself should also be fairly uniform. We checked for faulty/flaky memory as one of the first debugging steps.

golang / go