golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.91k stars 17.65k forks source link

runtime: panic not generating a correct backtrace stack while crashing in cgo #63277

Closed zzkcode closed 11 months ago

zzkcode commented 1 year ago

What version of Go are you using (go version)?

$ go version
go version go1.21.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/root/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/root/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.1'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build4191549828=/tmp/go-build -gno-record-gcc-switches'

What did you do?

I'm writing a Golang+CGO program, and testing a bad code in CGO to check if it could catch the right backtrace. Different versions of Go give me different outputs, and it's working in go1.18.3, but not working on go1.21.1 and the latest gotip version(go version devel go1.22-1176052 Thu Sep 28 03:38:07 2023 +0000 linux/amd64).

Like the backtrace from go1.18.3, I would expected it to give the right backtrace which caused the crash. I noticed that if it generates fatal error: unexpected signal during runtime execution and with runtime stack: in the stdout, then it's correct. Otherwise, it is not correct. Extracting the C code would give me the right backtrace btw.

What did you expect to see?

Generating the correct backtrace(details will be in the Details section).

What did you see instead?

It's not generating the correct backtrace(details will be in the Details section).

Details

go1.18.3 from the backtrace, we can see the crash is from core function which in frame 14:

(gdb) bt
#0  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:168
#1  0x0000000000447e45 in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:852
#2  0x0000000000448445 in runtime.sigfwdgo (sig=6, info=<optimized out>, ctx=<optimized out>, ~r0=<optimized out>) at /usr/local/go/src/runtime/signal_unix.go:1066
#3  0x0000000000446e67 in runtime.sigtrampgo (sig=0, info=0xc00000b208, ctx=0x4602e1 <runtime.raise+33>) at /usr/local/go/src/runtime/signal_unix.go:430
#4  0x000000000046108e in runtime.sigtrampgo (sig=6, info=0xc00000b4b0, ctx=0xc00000b380) at <autogenerated>:1
#5  0x00000000004605bd in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:361
#6  <signal handler called>
#7  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:168
#8  0x0000000000447e45 in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:852
#9  0x0000000000448058 in runtime.crash () at /usr/local/go/src/runtime/signal_unix.go:944
#10 0x0000000000433c71 in runtime.fatalthrow.func1 () at /usr/local/go/src/runtime/panic.go:1051
#11 0x0000000000433bf0 in runtime.fatalthrow () at /usr/local/go/src/runtime/panic.go:1044
#12 0x00000000004339b1 in runtime.throw (s=...) at /usr/local/go/src/runtime/panic.go:992
#13 0x0000000000447de5 in runtime.sigpanic () at /usr/local/go/src/runtime/signal_unix.go:802
#14 0x000000000046e7f0 in core () at /data-share/badstack/main.go:20
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

std:

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x46e7f0]

runtime stack:
runtime.throw({0x486e64?, 0x40bb79?})
    /usr/local/go/src/runtime/panic.go:992 +0x71 fp=0x7ffd80fa4310 sp=0x7ffd80fa42e0 pc=0x4339b1
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:802 +0x225 fp=0x7ffd80fa4340 sp=0x7ffd80fa4310 pc=0x447de5

goroutine 1 [syscall]:
runtime.cgocall(0x46e8e0, 0xc00010df50)
    /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc00010df28 sp=0xc00010def0 pc=0x403f3c
main._Cfunc_core_logic()
    _cgo_gotypes.go:41 +0x45 fp=0xc00010df50 sp=0xc00010df28 pc=0x46e725
main.coreLogic()
    /data-share/badstack/main.go:48 +0x17 fp=0xc00010df60 sp=0xc00010df50 pc=0x46e777
main.main()
    /data-share/badstack/main.go:55 +0x2a fp=0xc00010df80 sp=0xc00010df60 pc=0x46e7ca
runtime.main()
    /usr/local/go/src/runtime/proc.go:250 +0x1d8 fp=0xc00010dfe0 sp=0xc00010df80 pc=0x436038
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc00010dfe8 sp=0xc00010dfe0 pc=0x45ea01

goroutine 2 [force gc (idle)]:
runtime.gopark(0x488500, 0x6ea590, 0x11, 0x14, 0x1)
    /usr/local/go/src/runtime/proc.go:361 +0xf2 fp=0xc000046f88 sp=0xc000046f58 pc=0x436452
runtime.goparkunlock(0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:367 +0x2a fp=0xc000046fb8 sp=0xc000046f88 pc=0x4364ea
runtime.forcegchelper()
    /usr/local/go/src/runtime/proc.go:301 +0xa5 fp=0xc000046fe0 sp=0xc000046fb8 pc=0x436285
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc000046fe8 sp=0xc000046fe0 pc=0x45ea01
created by runtime.init.6
    /usr/local/go/src/runtime/proc.go:289 +0x25

goroutine 3 [GC sweep wait]:
runtime.gopark(0x488500, 0x6ea720, 0xc, 0x14, 0x1)
    /usr/local/go/src/runtime/proc.go:361 +0xf2 fp=0xc000047768 sp=0xc000047738 pc=0x436452
runtime.goparkunlock(0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:367 +0x2a fp=0xc000047798 sp=0xc000047768 pc=0x4364ea
runtime.bgsweep(0x0?)
    /usr/local/go/src/runtime/mgcsweep.go:278 +0x98 fp=0xc0000477c8 sp=0xc000047798 pc=0x421ab8
runtime.gcenable.func1()
    /usr/local/go/src/runtime/mgc.go:177 +0x26 fp=0xc0000477e0 sp=0xc0000477c8 pc=0x4179a6
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0000477e8 sp=0xc0000477e0 pc=0x45ea01
created by runtime.gcenable
    /usr/local/go/src/runtime/mgc.go:177 +0x6b

goroutine 4 [GC scavenge wait]:
runtime.gopark(0x488500, 0x6ea6e0, 0xd, 0x14, 0x1)
    /usr/local/go/src/runtime/proc.go:361 +0xf2 fp=0xc000047ef0 sp=0xc000047ec0 pc=0x436452
runtime.goparkunlock(0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:367 +0x2a fp=0xc000047f20 sp=0xc000047ef0 pc=0x4364ea
runtime.bgscavenge(0x0?)
    /usr/local/go/src/runtime/mgcscavenge.go:272 +0xed fp=0xc000047fc8 sp=0xc000047f20 pc=0x41f54d
runtime.gcenable.func2()
    /usr/local/go/src/runtime/mgc.go:178 +0x26 fp=0xc000047fe0 sp=0xc000047fc8 pc=0x417946
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc000047fe8 sp=0xc000047fe0 pc=0x45ea01
created by runtime.gcenable
    /usr/local/go/src/runtime/mgc.go:178 +0xaa

goroutine 5 [finalizer wait]:
runtime.gopark(0x488500, 0x719890, 0x10, 0x14, 0x1)
    /usr/local/go/src/runtime/proc.go:361 +0xf2 fp=0xc000046600 sp=0xc0000465d0 pc=0x436452
runtime.goparkunlock(0x0?, 0x60?, 0x66?, 0xc000046770?)
    /usr/local/go/src/runtime/proc.go:367 +0x2a fp=0xc000046630 sp=0xc000046600 pc=0x4364ea
runtime.runfinq()
    /usr/local/go/src/runtime/mfinal.go:177 +0xab fp=0xc0000467e0 sp=0xc000046630 pc=0x416a2b
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0000467e8 sp=0xc0000467e0 pc=0x45ea01
created by runtime.createfing
    /usr/local/go/src/runtime/mfinal.go:157 +0x45
Aborted (core dumped)

go1.21.1 from the backtrace, we can not tell which function cause this crash:

(gdb) bt
#0  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:154
#1  0x000000000044b0c5 in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:903
#2  0x000000000044b755 in runtime.sigfwdgo (sig=6, info=<optimized out>, ctx=<optimized out>, ~r0=<optimized out>) at /usr/local/go/src/runtime/signal_unix.go:1108
#3  0x000000000044a025 in runtime.sigtrampgo (sig=0, info=0xc0001063b0, ctx=0x4642a1 <runtime.raise+33>) at /usr/local/go/src/runtime/signal_unix.go:432
#4  0x0000000000464586 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:352
#5  <signal handler called>
#6  runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:154
#7  0x000000000044b0c5 in runtime.dieFromSignal (sig=6) at /usr/local/go/src/runtime/signal_unix.go:903
#8  0x000000000044b2b2 in runtime.crash () at /usr/local/go/src/runtime/signal_unix.go:985
#9  0x000000000044a9c5 in runtime.sighandler (sig=3, info=<optimized out>, ctxt=<optimized out>, gp=0xc000007860) at /usr/local/go/src/runtime/signal_unix.go:771
#10 0x000000000044a151 in runtime.sigtrampgo (sig=3, info=0xc0001074b0, ctx=0xc000107380) at /usr/local/go/src/runtime/signal_unix.go:490
#11 0x0000000000464586 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:352
#12 <signal handler called>
#13 runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:557
#14 0x00000000004321f0 in runtime.futexsleep (addr=0xca, val=0, ns=4605987) at /usr/local/go/src/runtime/os_linux.go:69
#15 0x000000000040b77d in runtime.notesleep (n=0x713e58 <runtime.newmHandoff+24>) at /usr/local/go/src/runtime/lock_futex.go:160
#16 0x000000000043c6e5 in runtime.templateThread () at /usr/local/go/src/runtime/proc.go:2514
#17 0x000000000043b1d3 in runtime.mstart1 () at /usr/local/go/src/runtime/proc.go:1600
#18 0x000000000043b125 in runtime.mstart0 () at /usr/local/go/src/runtime/proc.go:1557
#19 0x0000000000460a45 in runtime.mstart () at /usr/local/go/src/runtime/asm_amd64.s:394
#20 0x000000000046c848 in crosscall_amd64 () at gcc_amd64.S:42
#21 0x00007f36cffff700 in ?? ()
#22 0x000000c000007860 in ?? ()
#23 0x0000000000a01000 in ?? ()
#24 0x0000000000000000 in ?? ()

std:

SIGSEGV: segmentation violation
PC=0x46bd10 m=0 sigcode=1
signal arrived during cgo execution

goroutine 1 [syscall]:
runtime.cgocall(0x46be00, 0xc00005df38)
    /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc00005df10 sp=0xc00005ded8 pc=0x40506b
main._Cfunc_core_logic()
    _cgo_gotypes.go:39 +0x3f fp=0xc00005df38 sp=0xc00005df10 pc=0x46bc5f
main.coreLogic()
    /data-share/badstack/main.go:48 +0xf fp=0xc00005df48 sp=0xc00005df38 pc=0x46bc8f
main.main()
    /data-share/badstack/main.go:55 +0x25 fp=0xc00005df68 sp=0xc00005df48 pc=0x46bce5
runtime.main()
    /usr/local/go/src/runtime/proc.go:267 +0x267 fp=0xc00005dfe0 sp=0xc00005df68 pc=0x438627
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00005dfe8 sp=0xc00005dfe0 pc=0x462ac1

goroutine 2 [force gc (idle)]:
runtime.gopark(0x486da8, 0x6e5850, 0x11, 0x14, 0x1)
    /usr/local/go/src/runtime/proc.go:398 +0xfc fp=0xc000048f80 sp=0xc000048f50 pc=0x438a9c
runtime.goparkunlock(0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:404 +0x25 fp=0xc000048fb0 sp=0xc000048f80 pc=0x438b25
runtime.forcegchelper()
    /usr/local/go/src/runtime/proc.go:322 +0xb5 fp=0xc000048fe0 sp=0xc000048fb0 pc=0x4388b5
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000048fe8 sp=0xc000048fe0 pc=0x462ac1
created by runtime.init.6 in goroutine 1
    /usr/local/go/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait]:
runtime.gopark(0x486da8, 0x6e5a00, 0xc, 0x14, 0x1)
    /usr/local/go/src/runtime/proc.go:398 +0xfc fp=0xc000049750 sp=0xc000049720 pc=0x438a9c
runtime.goparkunlock(0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:404 +0x25 fp=0xc000049780 sp=0xc000049750 pc=0x438b25
runtime.bgsweep(0x0?)
    /usr/local/go/src/runtime/mgcsweep.go:280 +0xa5 fp=0xc0000497c8 sp=0xc000049780 pc=0x423105
runtime.gcenable.func1()
    /usr/local/go/src/runtime/mgc.go:200 +0x25 fp=0xc0000497e0 sp=0xc0000497c8 pc=0x4175c5
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000497e8 sp=0xc0000497e0 pc=0x462ac1
created by runtime.gcenable in goroutine 1
    /usr/local/go/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]:
runtime.gopark(0x486da8, 0x6e5a80, 0xd, 0x14, 0x2)
    /usr/local/go/src/runtime/proc.go:398 +0xfc fp=0xc000049f40 sp=0xc000049f10 pc=0x438a9c
runtime.goparkunlock(0x489650?, 0x1?, 0x0?, 0xc0000071e0?)
    /usr/local/go/src/runtime/proc.go:404 +0x25 fp=0xc000049f70 sp=0xc000049f40 pc=0x438b25
runtime.(*scavengerState).park(0x6e5a80)
    /usr/local/go/src/runtime/mgcscavenge.go:425 +0x45 fp=0xc000049f98 sp=0xc000049f70 pc=0x4203e5
runtime.bgscavenge(0x0?)
    /usr/local/go/src/runtime/mgcscavenge.go:653 +0x45 fp=0xc000049fc8 sp=0xc000049f98 pc=0x420945
runtime.gcenable.func2()
    /usr/local/go/src/runtime/mgc.go:201 +0x25 fp=0xc000049fe0 sp=0xc000049fc8 pc=0x417565
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x462ac1
created by runtime.gcenable in goroutine 1
    /usr/local/go/src/runtime/mgc.go:201 +0xa5

goroutine 5 [finalizer wait]:
runtime.gopark(0x486cb0, 0x713bc0, 0x10, 0x14, 0x1)
    /usr/local/go/src/runtime/proc.go:398 +0xfc fp=0xc000048628 sp=0xc0000485f8 pc=0x438a9c
runtime.runfinq()
    /usr/local/go/src/runtime/mfinal.go:193 +0xfa fp=0xc0000487e0 sp=0xc000048628 pc=0x41667a
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000487e8 sp=0xc0000487e0 pc=0x462ac1
created by runtime.createfing in goroutine 1
    /usr/local/go/src/runtime/mfinal.go:163 +0x45

rax    0x0
rbx    0x7ffd40b52e80
rcx    0x100000
rdx    0x7ffd40a52e90
rdi    0x713318
rsi    0x6e5b20
rbp    0x7ffd40b52e70
rsp    0x7ffd40b52e70
r8     0x6e5f00
r9     0x0
r10    0x8
r11    0x206
r12    0xc00005dca0
r13    0xffffffffffffffff
r14    0xc0000061a0
r15    0x2031
rip    0x46bd10
rflags 0x10246
cs     0x33
fs     0x0
gs     0x0

-----

SIGQUIT: quit
PC=0x464823 m=2 sigcode=0

goroutine 0 [idle]:
runtime.futex()
    /usr/local/go/src/runtime/sys_linux_amd64.s:558 +0x23 fp=0x7f36d6ebcd20 sp=0x7f36d6ebcd18 pc=0x464823
runtime.futexsleep(0x40b513?, 0x40b55e?, 0x7f36d6ebcdd0?)
    /usr/local/go/src/runtime/os_linux.go:75 +0x98 fp=0x7f36d6ebcd80 sp=0x7f36d6ebcd20 pc=0x432258
runtime.notetsleep_internal(0x6e6400, 0xdf8475800)
    /usr/local/go/src/runtime/lock_futex.go:202 +0xbb fp=0x7f36d6ebcdc0 sp=0x7f36d6ebcd80 pc=0x40b89b
runtime.notetsleep(0xb6500000014?, 0x7f3600000008?)
    /usr/local/go/src/runtime/lock_futex.go:225 +0x29 fp=0x7f36d6ebcde0 sp=0x7f36d6ebcdc0 pc=0x40b9c9
runtime.sysmon()
    /usr/local/go/src/runtime/proc.go:5570 +0x1d6 fp=0x7f36d6ebce60 sp=0x7f36d6ebcde0 pc=0x442f36
runtime.mstart1()
    /usr/local/go/src/runtime/proc.go:1600 +0x93 fp=0x7f36d6ebce88 sp=0x7f36d6ebce60 pc=0x43b1d3
runtime.mstart0()
    /usr/local/go/src/runtime/proc.go:1557 +0x85 fp=0x7f36d6ebcec0 sp=0x7f36d6ebce88 pc=0x43b125
runtime.mstart()
    /usr/local/go/src/runtime/asm_amd64.s:394 +0x5 fp=0x7f36d6ebcec8 sp=0x7f36d6ebcec0 pc=0x460a45
rax    0xfffffffffffffffc
rbx    0x0
rcx    0x464823
rdx    0x0
rdi    0x6e6400
rsi    0x80
rbp    0x7f36d6ebcd70
rsp    0x7f36d6ebcd18
r8     0x0
r9     0x0
r10    0x7f36d6ebcd58
r11    0x246
r12    0x7f36d6ebcd48
r13    0xa01000
r14    0xc0000069c0
r15    0x7f36d6ebd700
rip    0x464823
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

-----

SIGQUIT: quit
PC=0x464821 m=3 sigcode=0

goroutine 0 [idle]:
runtime.futex()
    /usr/local/go/src/runtime/sys_linux_amd64.s:557 +0x21 fp=0x7f36d64bbbe8 sp=0x7f36d64bbbe0 pc=0x464821
runtime.futexsleep(0x40b55e?, 0xd64bbc68?, 0x40bbb3?)
    /usr/local/go/src/runtime/os_linux.go:69 +0x30 fp=0x7f36d64bbc48 sp=0x7f36d64bbbe8 pc=0x4321f0
runtime.notesleep(0xc00004e948)
    /usr/local/go/src/runtime/lock_futex.go:160 +0x9d fp=0x7f36d64bbc80 sp=0x7f36d64bbc48 pc=0x40b77d
runtime.mPark()
    /usr/local/go/src/runtime/proc.go:1632 +0x1e fp=0x7f36d64bbca0 sp=0x7f36d64bbc80 pc=0x43b2be
runtime.stopm()
    /usr/local/go/src/runtime/proc.go:2536 +0x6d fp=0x7f36d64bbcc8 sp=0x7f36d64bbca0 pc=0x43c7ed
runtime.gcstopm()
    /usr/local/go/src/runtime/proc.go:2835 +0xb6 fp=0x7f36d64bbcf8 sp=0x7f36d64bbcc8 pc=0x43d176
runtime.findRunnable()
    /usr/local/go/src/runtime/proc.go:2901 +0x65 fp=0x7f36d64bbe18 sp=0x7f36d64bbcf8 pc=0x43d3a5
runtime.schedule()
    /usr/local/go/src/runtime/proc.go:3582 +0xbd fp=0x7f36d64bbe50 sp=0x7f36d64bbe18 pc=0x43ea7d
runtime.park_m(0xc000006b60?)
    /usr/local/go/src/runtime/proc.go:3745 +0x105 fp=0x7f36d64bbe80 sp=0x7f36d64bbe50 pc=0x43efe5
runtime.mcall()
    /usr/local/go/src/runtime/asm_amd64.s:458 +0x4e fp=0x7f36d64bbe98 sp=0x7f36d64bbe80 pc=0x460ace
rax    0xca
rbx    0x0
rcx    0x464823
rdx    0x0
rdi    0xc00004e948
rsi    0x80
rbp    0x7f36d64bbc38
rsp    0x7f36d64bbbe0
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x7f36d64bbc88
r13    0xffffffffffffffff
r14    0xc000006ea0
r15    0x2031
rip    0x464821
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

-----

SIGQUIT: quit
PC=0x464821 m=4 sigcode=0

goroutine 0 [idle]:
runtime.futex()
    /usr/local/go/src/runtime/sys_linux_amd64.s:557 +0x21 fp=0x7f36d5abad10 sp=0x7f36d5abad08 pc=0x464821
runtime.futexsleep(0x40b55e?, 0xd5abad90?, 0x40bbb3?)
    /usr/local/go/src/runtime/os_linux.go:69 +0x30 fp=0x7f36d5abad70 sp=0x7f36d5abad10 pc=0x4321f0
runtime.notesleep(0xc00004ed48)
    /usr/local/go/src/runtime/lock_futex.go:160 +0x9d fp=0x7f36d5abada8 sp=0x7f36d5abad70 pc=0x40b77d
runtime.mPark()
    /usr/local/go/src/runtime/proc.go:1632 +0x1e fp=0x7f36d5abadc8 sp=0x7f36d5abada8 pc=0x43b2be
runtime.stopm()
    /usr/local/go/src/runtime/proc.go:2536 +0x6d fp=0x7f36d5abadf0 sp=0x7f36d5abadc8 pc=0x43c7ed
runtime.startlockedm(0x40b513?)
    /usr/local/go/src/runtime/proc.go:2808 +0x71 fp=0x7f36d5abae18 sp=0x7f36d5abadf0 pc=0x43d071
runtime.schedule()
    /usr/local/go/src/runtime/proc.go:3628 +0x6d fp=0x7f36d5abae50 sp=0x7f36d5abae18 pc=0x43ea2d
runtime.park_m(0xc000007040?)
    /usr/local/go/src/runtime/proc.go:3745 +0x105 fp=0x7f36d5abae80 sp=0x7f36d5abae50 pc=0x43efe5
runtime.mcall()
    /usr/local/go/src/runtime/asm_amd64.s:458 +0x4e fp=0x7f36d5abae98 sp=0x7f36d5abae80 pc=0x460ace
rax    0xca
rbx    0x0
rcx    0x464823
rdx    0x0
rdi    0xc00004ed48
rsi    0x80
rbp    0x7f36d5abad60
rsp    0x7f36d5abad08
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x7f36d5abad78
r13    0xa01000
r14    0xc000007520
r15    0x7f36d5abb700
rip    0x464821
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

-----

SIGQUIT: quit
PC=0x464821 m=5 sigcode=0

goroutine 0 [idle]:
runtime.futex()
    /usr/local/go/src/runtime/sys_linux_amd64.s:557 +0x21 fp=0x7f36d50b9be8 sp=0x7f36d50b9be0 pc=0x464821
runtime.futexsleep(0x40b55e?, 0xd50b9c68?, 0x40bbb3?)
    /usr/local/go/src/runtime/os_linux.go:69 +0x30 fp=0x7f36d50b9c48 sp=0x7f36d50b9be8 pc=0x4321f0
runtime.notesleep(0xc000080148)
    /usr/local/go/src/runtime/lock_futex.go:160 +0x9d fp=0x7f36d50b9c80 sp=0x7f36d50b9c48 pc=0x40b77d
runtime.mPark()
    /usr/local/go/src/runtime/proc.go:1632 +0x1e fp=0x7f36d50b9ca0 sp=0x7f36d50b9c80 pc=0x43b2be
runtime.stopm()
    /usr/local/go/src/runtime/proc.go:2536 +0x6d fp=0x7f36d50b9cc8 sp=0x7f36d50b9ca0 pc=0x43c7ed
runtime.gcstopm()
    /usr/local/go/src/runtime/proc.go:2835 +0xb6 fp=0x7f36d50b9cf8 sp=0x7f36d50b9cc8 pc=0x43d176
runtime.findRunnable()
    /usr/local/go/src/runtime/proc.go:2901 +0x65 fp=0x7f36d50b9e18 sp=0x7f36d50b9cf8 pc=0x43d3a5
runtime.schedule()
    /usr/local/go/src/runtime/proc.go:3582 +0xbd fp=0x7f36d50b9e50 sp=0x7f36d50b9e18 pc=0x43ea7d
runtime.park_m(0xc000007a00?)
    /usr/local/go/src/runtime/proc.go:3745 +0x105 fp=0x7f36d50b9e80 sp=0x7f36d50b9e50 pc=0x43efe5
runtime.mcall()
    /usr/local/go/src/runtime/asm_amd64.s:458 +0x4e fp=0x7f36d50b9e98 sp=0x7f36d50b9e80 pc=0x460ace
rax    0xca
rbx    0x0
rcx    0x464823
rdx    0x0
rdi    0xc000080148
rsi    0x80
rbp    0x7f36d50b9c38
rsp    0x7f36d50b9be0
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x7f36d50b9d78
r13    0xa01000
r14    0xc0000821a0
r15    0x7f36d50ba700
rip    0x464821
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

-----

SIGQUIT: quit
PC=0x464821 m=6 sigcode=0

goroutine 0 [idle]:
runtime.futex()
    /usr/local/go/src/runtime/sys_linux_amd64.s:557 +0x21 fp=0x7f36cfffeda0 sp=0x7f36cfffed98 pc=0x464821
runtime.futexsleep(0x40bc13?, 0x713e40?, 0x7f36cfffee28?)
    /usr/local/go/src/runtime/os_linux.go:69 +0x30 fp=0x7f36cfffee00 sp=0x7f36cfffeda0 pc=0x4321f0
runtime.notesleep(0x713e58)
    /usr/local/go/src/runtime/lock_futex.go:160 +0x9d fp=0x7f36cfffee38 sp=0x7f36cfffee00 pc=0x40b77d
runtime.templateThread()
    /usr/local/go/src/runtime/proc.go:2514 +0x65 fp=0x7f36cfffee60 sp=0x7f36cfffee38 pc=0x43c6e5
runtime.mstart1()
    /usr/local/go/src/runtime/proc.go:1600 +0x93 fp=0x7f36cfffee88 sp=0x7f36cfffee60 pc=0x43b1d3
runtime.mstart0()
    /usr/local/go/src/runtime/proc.go:1557 +0x85 fp=0x7f36cfffeec0 sp=0x7f36cfffee88 pc=0x43b125
runtime.mstart()
    /usr/local/go/src/runtime/asm_amd64.s:394 +0x5 fp=0x7f36cfffeec8 sp=0x7f36cfffeec0 pc=0x460a45
rax    0xca
rbx    0x0
rcx    0x464823
rdx    0x0
rdi    0x713e58
rsi    0x80
rbp    0x7f36cfffedf0
rsp    0x7f36cfffed98
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x0
r13    0xa01000
r14    0xc000007860
r15    0x7f36cffff700
rip    0x464821
rflags 0x286
cs     0x33
fs     0x0
gs     0x0
Aborted (core dumped)

Pure C from the backtrace, we can see the crash is caused by core function in frame 0.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000400660 in core () at main.c:15
15    *ptr = 1024;
(gdb) bt
#0  0x0000000000400660 in core () at main.c:15
#1  0x00007feccc2cb190 in ?? () from /lib64/libc.so.6
#2  0x0000000000601408 in uctx_core ()
#3  0x0000000000000000 in ?? ()

Reproducer

Please compile with: CC=clang CXX=clang++ CFLAGS="$(cflags)" go build -gcflags="all=-N -l" main.go

package main

/*
#cgo CFLAGS: -g -O0
#include <stdio.h>
#include <stddef.h>
#include <ucontext.h>
#include <stdlib.h>
#include <signal.h>

static ucontext_t uctx_main, uctx_core;

void core()
{
  // core logic

  // trigger crash
  int* ptr = NULL;
  *ptr = 1024;
}

void core_logic()
{
  size_t size = 1024 * 1024;
  char stack[size]; // SIGSEGV
  //void* stack = malloc(size); // SIGTRAP

  if (getcontext(&uctx_core) == -1)
    printf("failed to getcontext");

  uctx_core.uc_stack.ss_sp = stack;
  uctx_core.uc_stack.ss_size = size;
  uctx_core.uc_link = &uctx_main;

  makecontext(&uctx_core, core, 0);

  if (swapcontext(&uctx_main, &uctx_core) == -1)
    printf("failed to swapcontext");

  printf("back\n");
}
*/
import "C"
import "runtime/debug"

func coreLogic() {
  C.core_logic()
}

func main() {
  debug.SetTraceback("crash")

  // Call the C function from Go
  coreLogic()
}

Environment

# clang -v
clang version 10.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.2
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.5
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.5
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64

# gdb -v
GNU gdb (GDB) 9.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Updated on 2023-09-29: I think it's related to use ucontext in CGO, since if I do not bind function with ucontext, just call core C function directly instead of calling core_logic, the stack backtrace would be fine. And it may has something related to this issue https://github.com/golang/go/issues/62130.

cherrymui commented 1 year ago

For GDB, are you running the program in GDB, or you're using GDB to debug a core dump? For the former, GDB should stop when it first gets a SIGSEGV, before the Go signal handler runs. You should be able to get a backtrace from there.

PC=0x46bd10

For the stderr output, it prints the PC where the fault happens, so you at least can see which function it is, although not a full stack trace.

zzkcode commented 1 year ago

Hi @cherrymui, thanks for your comment. I'm using GDB to debug a core dump. And let's be more clearly, I want to catch the right stack backtrace in the real application's core dump, so I do this for a verification. Regarding to using PC to see what really happen in a real application may not be enough without a correct backtrace.

I think it's related to use ucontext in CGO, since if I do not bind function with ucontext, just call core C function directly instead of calling core_logic, the stack backtrace would be fine.

It may has something related to this issue https://github.com/golang/go/issues/62130? However, gotip version does not fix this. FYI @ianlancetaylor

package main

/*
#cgo CFLAGS: -g -O0
#include <stdio.h>
#include <stddef.h>
#include <ucontext.h>
#include <stdlib.h>
#include <signal.h>

static ucontext_t uctx_main, uctx_core;

void core()
{
  // core logic

  // trigger crash
  int* ptr = NULL;
  *ptr = 1024;
}

void core_logic()
{
  size_t size = 1024 * 1024;
  char stack[size]; // SIGSEGV
  //void* stack = malloc(size); // SIGTRAP

  if (getcontext(&uctx_core) == -1)
    printf("failed to getcontext");

  uctx_core.uc_stack.ss_sp = stack;
  uctx_core.uc_stack.ss_size = size;
  uctx_core.uc_link = &uctx_main;

  makecontext(&uctx_core, core, 0);

  if (swapcontext(&uctx_main, &uctx_core) == -1)
    printf("failed to swapcontext");

  printf("back\n");
}
*/
import "C"
import "runtime/debug"

func coreLogic() {
  C.core() // call core directly
}

func main() {
  debug.SetTraceback("crash")

  // Call the C function from Go
  coreLogic()
}
thanm commented 1 year ago

@golang/runtime

prattmic commented 1 year ago

Just to be sure I understand correctly, the issue you are reporting that that in 1.18, from a core file gdb shows the faulting function in a backtrace, while in 1.21 gdb does not show the faulting function?

It is not about the panic output from the runtime (which in your example IMO is better in 1.21 than 1.18).

zzkcode commented 1 year ago

Hi @prattmic, thanks for your comment.

Just to be sure I understand correctly, the issue you are reporting that that in 1.18, from a core file gdb shows the faulting function in a backtrace, while in 1.21 gdb does not show the faulting function?

Yes, exactly! I suspect that it's related to using ucontext in cgo, since I call C.core function directly instead of calling C.core_logic in go1.21.1, then the stack backtrace will be correct. The C.core_logic just use ucontext to bind core function.

It is not about the panic output from the runtime (which in your example IMO is better in 1.21 than 1.18).

The panic output in the first few lines has more or less show something output during it's execution in runtime. Everytime I see a correct backtrace, I would see a runtime stack output, which I believe it's outputted in panic.go's dopanic_m.

output with a runtime stack:

runtime stack:
runtime.throw({0x486e64?, 0x40bb79?})
    /usr/local/go/src/runtime/panic.go:992 +0x71 fp=0x7ffd80fa4310 sp=0x7ffd80fa42e0 pc=0x4339b1
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:802 +0x225 fp=0x7ffd80fa4340 sp=0x7ffd80fa4310 pc=0x447de5

master brach: https://github.com/golang/go/blob/master/src/runtime/panic.go#L1334

// src/runtime/panic.go
        } else if level >= 2 || gp.m.throwing >= throwTypeRuntime {
            print("\nruntime stack:\n")
            traceback(pc, sp, 0, gp)
        }
zzkcode commented 1 year ago

Updated: I did a test and it's still working on go1.20.8. So it may be broken in go1.21.1 go1.21.0? FYI @prattmic

zzkcode commented 1 year ago

More update: It seems like the thread that caused the crash was never the number one thread in GDB, it may be harder to find out in a real application, since it may have hundreds of threads. FYI @prattmic @cherrymui

Steps to reproduce under go1.21.1:

show all threads, and switch to every thread to check the backtrace, finally, it turns out the thread id 6 is the one that caused the crash.

(gdb) i threads
  Id   Target Id                        Frame
* 1    Thread 0x7ff426ffd700 (LWP 5269) runtime.raise () at /usr/local/go/src/runtime/sys_linux_amd64.s:154
  2    Thread 0x7ff42ca17700 (LWP 5266) runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:135
  3    Thread 0x7ff4277fe700 (LWP 5268) runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:135
  4    Thread 0x7ff474438740 (LWP 5264) runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:135
  5    Thread 0x7ff42d218700 (LWP 5265) runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:135
  6    Thread 0x7ff427fff700 (LWP 5267) runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:135
(gdb) t 6
[Switching to thread 6 (Thread 0x7ff427fff700 (LWP 5267))]
#0  runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:135
135     RET
(gdb) bt
#0  runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:135
#1  0x000000000044f473 in runtime.sighandler (sig=11, info=<optimized out>, ctxt=<optimized out>, gp=0xc000007ba0) at /usr/local/go/src/runtime/signal_unix.go:769
#2  0x000000000044ec11 in runtime.sigtrampgo (sig=11, info=0xc000087bf0, ctx=0xc000087ac0) at /usr/local/go/src/runtime/signal_unix.go:490
#3  0x000000000046aa66 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:352
#4  <signal handler called>
#5  0x000000000049c340 in core () at /container_share/works/badstack/main.go:19
#6  0x00007ff473c52150 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 from /lib64/libc.so.6
#7  0x0000000000765f80 in uctx_core ()
#8  0x0000000000440551 in runtime.execute (gp=0x49c459 <_cgo_96a711317223_Cfunc_core_logic+25>, inheritTime=184) at /usr/local/go/src/runtime/proc.go:2884
#9  0x00007ff427ffee10 in ?? ()
#10 0x000000000049c459 in _cgo_96a711317223_Cfunc_core_logic (v=0x100000) at cgo-gcc-prolog:49
Backtrace stopped: frame did not save the PC

Btw, we can try dlv and then gdb since we already know the bad goroutine id from the stdout/err:

  1. dlv: using goroutine id to find out the actual bad thread LWP id(check which thread it is running on);
  2. gdb: now we know the thread LWP id, just jump directly to that thread;

However, this is not so intuitive to me since the crash is caused by cgo code, we may usually use GDB to debug it at the very first.

ps: as I mentioned before, this issue happens on using ucontext in cgo since go1.21.1 go1.21.0.

cherrymui commented 1 year ago

Thanks. The signal on the "correct" thread seems to land at https://cs.opensource.google/go/go/+/refs/tags/go1.21.1:src/runtime/signal_unix.go;l=756-769, which is re-raising the signal when it thinks it is already crashing and another thread is dumping a stack trace.

Maybe that M count at https://cs.opensource.google/go/go/+/refs/tags/go1.21.1:src/runtime/signal_unix.go;l=756 isn't quite right and this is actually the first thread crashing? Could you print in GDB the value of mcount() and *extraMLength? mcount() is sched.mnext - sched.nmfreed. crashing is probably 1 if this is the first thread crashing.

zzkcode commented 1 year ago

@cherrymui Thanks. It seems to be exactly what you are saying.

(GDB could not print out some of them, so I used dlv instead.)

In go1.20.8, it looks like the program panic starts from go1.20.8-signal_unix.go#L675, return, and never runs the rest of the code in sighandler. In go1.21.1, however, it will pass by and run docrash block starts from go1.21.1-signal_unix.go#L754. crashing is increased to 1, mcount() is 8, extraMLength is 1(details pasted as below), so it will indeed raise the signal SIGQUIT in the first thread which caused SIGSEGV?

(dlv) p sig
11
(dlv) n
> runtime.sighandler() /usr/local/go/src/runtime/signal_unix.go:756 (PC: 0x44a9ae)
Warning: debugging optimized function
   751:         dumpregs(c)
   752:     }
   753: 
   754:     if docrash {
   755:         crashing++
=> 756:         if crashing < mcount()-int32(extraMLength.Load()) {
   757:             // There are other m's that need to dump their stacks.
   758:             // Relay SIGQUIT to the next m by sending it to the current process.
   759:             // All m's that have already received SIGQUIT have signal masks blocking
   760:             // receipt of any signals, so the SIGQUIT will go to an m that hasn't seen it yet.
   761:             // When the last m receives the SIGQUIT, it will fall through to the call to
(dlv) p crashing
1
(dlv) p sched.mnext
8
(dlv) p sched.nmfreed
0
(dlv) p extraMLength
runtime/internal/atomic.Uint32 {
    noCopy: runtime/internal/atomic.noCopy {},
    value: 1,}

Since in go1.21.1, it just passes by the preparePanic starts from here, I check the condition statements(shown below). (Unfortunately, I can directly print c.info.si_code out even though I did use -gcflags="all=-N -l" to build it.) But I find it also prints out in the stdout in fatalsignal, which shows that the sigcode is 0, and will return true from sigFromUser and make !c.sigFromUser() always false. I don't know whether I debug this correctly, but please let me know whatever useful I can provide.

(dlv) n
> runtime.sighandler() /usr/local/go/src/runtime/signal_unix.go:690 (PC: 0x44a794)
Warning: debugging optimized function
   685:     if isAbortPC(c.sigpc()) {
   686:         // On many architectures, the abort function just
   687:         // causes a memory fault. Don't turn that into a panic.
   688:         flags = _SigThrow
   689:     }
=> 690:     if !c.sigFromUser() && flags&_SigPanic != 0 {
   691:         // The signal is going to cause a panic.
   692:         // Arrange the stack so that it looks like the point
   693:         // where the signal occurred made a call to the
   694:         // function sigpanic. Then set the PC to sigpanic.
   695: 
(dlv) p flags
4
(dlv) p _SigPanic
8
(dlv) p c.info.si_code
(unreadable empty OP stack)
(dlv) c
SIGQUIT: quit
PC=0x464881 m=3 sigcode=0

goroutine 0 [idle]:
runtime.futex()
    /usr/local/go/src/runtime/sys_linux_amd64.s:557 +0x21 fp=0x7fbfba4bdbf8 sp=0x7fbfba4bdbf0 pc=0x464881
runtime.futexsleep(0x40b5be?, 0xba4bdc78?, 0x40bc13?)
    /usr/local/go/src/runtime/os_linux.go:69 +0x30 fp=0x7fbfba4bdc58 sp=0x7fbfba4bdbf8 pc=0x432250
runtime.notesleep(0xc000058948)
    /usr/local/go/src/runtime/lock_futex.go:160 +0x9d fp=0x7fbfba4bdc90 sp=0x7fbfba4bdc58 pc=0x40b7dd
runtime.mPark()
    /usr/local/go/src/runtime/proc.go:1632 +0x1e fp=0x7fbfba4bdcb0 sp=0x7fbfba4bdc90 pc=0x43b31e
runtime.stopm()
    /usr/local/go/src/runtime/proc.go:2536 +0x6d fp=0x7fbfba4bdcd8 sp=0x7fbfba4bdcb0 pc=0x43c84d
runtime.findRunnable()
    /usr/local/go/src/runtime/proc.go:3229 +0x30 fp=0x7fbfba4bddf8 sp=0x7fbfba4bdcd8 pc=0x43d3d0
runtime.schedule()
    /usr/local/go/src/runtime/proc.go:3582 +0xbd fp=0x7fbfba4bde30 sp=0x7fbfba4bddf8 pc=0x43eadd
runtime.park_m(0xc000007a00?)
    /usr/local/go/src/runtime/proc.go:3745 +0x105 fp=0x7fbfba4bde60 sp=0x7fbfba4bde30 pc=0x43f045
runtime.mcall()
    /usr/local/go/src/runtime/asm_amd64.s:458 +0x4e fp=0x7fbfba4bde78 sp=0x7fbfba4bde60 pc=0x460b2e
rax    0xca
rbx    0x0
rcx    0x464883
rdx    0x0
rdi    0xc000058948
rsi    0x80
rbp    0x7fbfba4bdc48
rsp    0x7fbfba4bdbf0
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x7fbfba4bdc68
r13    0xffffffffffffffff
r14    0xc000006d00
r15    0x2031
rip    0x464881
rflags 0x286
cs     0x33
fs     0x0
gs     0x0
mknyszek commented 1 year ago

In triage, we're thinking that if this is fairly easy to reproduce, how hard would it be to bisect this across the Go toolchain? That might get down to the exact reason.

In the same vein, does Go 1.19 or Go 1.20 work?

cherrymui commented 1 year ago

@zzkcode Thanks. As the fault is from C code, I think it is desirable to not inject a sigpanic and return from the signal handler (like Go 1.20 did). This is intended to help debugging, because debuggers are not always able to unwind the synthetic sigpanic frame (on all platforms, although it sometimes does like the Go 1.18 example above). By not injecting the sigpanic frame debugger can directly unwind from the signal handler to C code.

I assume you set GOTRACEBACK=crash in order to trigger a core file generation, therefore docrash is true. In this case we relay SIGQUIT to all threads so they dump stack traces and registers. But that also means that all threads will receive a signal, not just the first faulting thread. I guess the core file is generated when a thread receives a signal the second time, as this time the signal is blocked and it is a fatal signal, so the kernel just kills it and generates a core. The GDB stack traces include two signal frames supports this. But it also means that it may be not the original faulting thread that dies first. In fact it may be the last one that receives the signal, as the condition crashing < mcount()-int32(extraMLength.Load()) is false for the last one, so it directly goes to crash. Unfortunately you'll need to loop over all threads to find the faulting one in GDB...

Maybe we could let the first faulting thread sleeps shorter. Maybe we could have the last thread signals the first one before crashing. So the first faulting thread will be the one crashes.

I wonder, though, why this has anything to do with ucontext...

zzkcode commented 1 year ago

Hi all, this issue should have nothing to do with using ucontext in cgo but a general problem for not dumping correct stacks while crashing in cgo, maybe I do the wrong test on the wrong go version before. For the release version, I missed testing it on go1.21.0, so I did some tests again, and it turns out that this issue starts from go1.21.0. Sorry for misleading you. FYI @cherrymui @mknyszek @bcmills @prattmic @ianlancetaylor

@zzkcode Thanks. As the fault is from C code, I think it is desirable to not inject a sigpanic and return from the signal handler (like Go 1.20 did). This is intended to help debugging, because debuggers are not always able to unwind the synthetic sigpanic frame (on all platforms, although it sometimes does like the Go 1.18 example above). By not injecting the sigpanic frame debugger can directly unwind from the signal handler to C code.

@cherrymui Thanks. If I don't get this wrong, in Go1.20, it will inject a sigpanic (my debug also proves this), while in Go1.21, it will not. In Go1.21, the flags is assigned to _SigThrow in go1.21.0-signal_unix.go#L683, so it will just pass by go1.21.0-signal_unix.go#L690, and finally call run into docrash block.

I assume you set GOTRACEBACK=crash in order to trigger a core file generation, therefore docrash is true. In this case we relay SIGQUIT to all threads so they dump stack traces and registers. But that also means that all threads will receive a signal, not just the first faulting thread. I guess the core file is generated when a thread receives a signal the second time, as this time the signal is blocked and it is a fatal signal, so the kernel just kills it and generates a core. The GDB stack traces include two signal frames supports this. But it also means that it may be not the original faulting thread that dies first. In fact it may be the last one that receives the signal, as the condition crashing < mcount()-int32(extraMLength.Load()) is false for the last one, so it directly goes to crash. Unfortunately you'll need to loop over all threads to find the faulting one in GDB...

Thanks for your explanation, now I get why the crash thread always not the number one thread in GDB.

Maybe we could let the first faulting thread sleeps shorter. Maybe we could have the last thread signals the first one before crashing. So the first faulting thread will be the one crashes.

It seems to me that without sigpanic now all crashes in cgo may have to find out the real crash thread by themselves since go1.21.0. Do I miss something here? Looks like there is no one has reported this yet?

I wonder, though, why this has anything to do with ucontext...

Sorry for the misleading, this seems not related to ucontext .

flags is set to _SigPanic in go1.21.0:

(dlv) b src/runtime/signal_unix.go:676
Breakpoint 1 set at 0x44a6b0 for runtime.sighandler() /usr/local/go/src/runtime/signal_unix.go:676
(dlv) b src/runtime/signal_unix.go:685
Breakpoint 2 set at 0x44a721 for runtime.sighandler() /usr/local/go/src/runtime/signal_unix.go:685
//...
(dlv) c
> runtime.sighandler() /usr/local/go/src/runtime/signal_unix.go:676 (hits goroutine(1):1 total:2) (PC: 0x44a6b0)
Warning: debugging optimized function
   671: 
   672:     flags := int32(_SigThrow)
   673:     if sig < uint32(len(sigtable)) {
   674:         flags = sigtable[sig].flags
   675:     }
=> 676:     if !c.sigFromUser() && flags&_SigPanic != 0 && (gp.throwsplit || gp != mp.curg) {
   677:         // We can't safely sigpanic because it may grow the
   678:         // stack. Abort in the signal handler instead.
   679:         //
   680:         // Also don't inject a sigpanic if we are not on a
   681:         // user G stack. Either we're in the runtime, or we're
(dlv) p flags
136
(dlv) c
> runtime.sighandler() /usr/local/go/src/runtime/signal_unix.go:685 (hits goroutine(1):1 total:2) (PC: 0x44a721)
Warning: debugging optimized function
   680:         // Also don't inject a sigpanic if we are not on a
   681:         // user G stack. Either we're in the runtime, or we're
   682:         // running C code. Either way we cannot recover.
   683:         flags = _SigThrow
   684:     }
=> 685:     if isAbortPC(c.sigpc()) {
   686:         // On many architectures, the abort function just
   687:         // causes a memory fault. Don't turn that into a panic.
   688:         flags = _SigThrow
   689:     }
   690:     if !c.sigFromUser() && flags&_SigPanic != 0 {
(dlv) p flags
4

Please let me know if I can provide more on this.

Btw, is there a workaround(seems like not?) or will it be considered a fix? The thing is that we have a program using ucontext which has an issue and should be fixed on https://github.com/golang/go/issues/62130, which required go1.21.0. Thanks.

zzkcode commented 1 year ago

Hi @cherrymui, by making the faulting thread sleep shorter, it would be able to generate a core dump with the faulting thread as the number one thread. I have verified this both in the simple reproducer and my application. Code changes are pasted below, and if you don't mind, I would love to fire a PR to fix this and backport it to the next minor release if possible. FYI @mknyszek @prattmic

This code changes may need more to do:

Code change example, should be refined to make it crash as soon as possible:

diff --git a/src/runtime/signal_unix.go b/src/runtime/signal_unix.go
index cd9fd5d796..09aa95c2dc 100644
--- a/src/runtime/signal_unix.go
+++ b/src/runtime/signal_unix.go
@@ -752,6 +752,12 @@ func sighandler(sig uint32, info *siginfo, ctxt unsafe.Pointer, gp *g) {
        }

        if docrash {
+               var sleepTime uint32
+               sleepTime = 5 * 1000 * 1000
+               if crashing == 0 {
+                       sleepTime = 3 * 1000 * 1000
+               }
+
                crashing++
                if crashing < mcount()-int32(extraMLength.Load()) {
                        // There are other m's that need to dump their stacks.
@@ -766,8 +772,8 @@ func sighandler(sig uint32, info *siginfo, ctxt unsafe.Pointer, gp *g) {
                        // 5-second sleeps have finished.
                        print("\n-----\n\n")
                        raiseproc(_SIGQUIT)
-                       usleep(5 * 1000 * 1000)
                }
+               usleep(sleepTime)
                printDebugLog()
                crash()
        }
gopherbot commented 1 year ago

Change https://go.dev/cl/536895 mentions this issue: runtime: let the fault thread to crash the process

zzkcode commented 1 year ago

Change https://go.dev/cl/536895 mentions this issue: runtime: let the fault thread to crash the process

Hi @cherrymui and all, I believe this is a general issue while using cgo since go1.21.0, and I submitted a pr to fix this(just a minor change), and please consider taking a look if possible. Thanks!

cherrymui commented 1 year ago

@zzkcode Thanks for the CL, I'll take a look. Question: with the change, will a crashing process always take at least 3 seconds to exit?

zzkcode commented 1 year ago

@zzkcode Thanks for the CL, I'll take a look. Question: with the change, will a crashing process always take at least 3 seconds to exit?

Hi @cherrymui, it will not. With the change, the first thread which cause crash will always sleep for every 500ms, and check if all m receive the SIGQUIT, then crash(if yes). So it depends on how long it takes to let all m receive the SIGQUIT, I believe this has nothing big change than before since it used to let the last m crash the process.

So in normal case, it basically takes no more than 500ms to crash, maybe we can sleep like every 200ms if you have any concerns? Thanks.

zzkcode commented 11 months ago

Hi @cherrymui and all, thanks! Since this fix pr is ready for merging, may we consider about backporting it to the next minor release of go1.21, considering:

Or it will be release in the next minor/major release? I don't know Go team's practice on this, will you just backport the real critical issue fix? Thanks :)

cherrymui commented 11 months ago

@zzkcode thanks for the CL. I think it is reasonable to include this CL in Go 1.22 as a bug fix. The risk is very low, as it only affects programs that are already crashing.

I don't think this needs to be backported. We usually just backport fixes for critical issues with no workaround. This is not critical (the program is crashing either way) and has workarounds (e.g. check all threads in GDB).

Thanks.

shuxiao9058 commented 9 months ago

The same issue.

miles-byted commented 2 months ago

@zzkcode @cherrymui Unfortunately, arm64 remains the problem:

$ gdb -nx -batch -ex bt cgo-crash /var/core/cgo-crash.1724899484.2105538.core
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Core was generated by `./cgo-crash'.
Program terminated with signal SIGABRT, Aborted.
#0  runtime.raise () at /opt/xxx/go/src/runtime/sys_linux_arm64.s:158
158     /opt/xxx/go/src/runtime/sys_linux_arm64.s: No such file or directory.
[Current thread is 1 (Thread 0x7f5b7fe1d0 (LWP 2105542))]
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /home/admin/cgo-crash.
Use `info auto-load python-scripts [REGEXP]' to list them.
#0  runtime.raise () at /opt/xxx/go/src/runtime/sys_linux_arm64.s:158
#1  0x000000000044e884 in runtime.dieFromSignal (sig=6) at /opt/xxx/go/src/runtime/signal_unix.go:923
#2  0x000000000044ef30 in runtime.sigfwdgo (sig=6, info=<optimized out>, ctx=<optimized out>, ~r0=<optimized out>) at /opt/xxx/go/src/runtime/signal_unix.go:1128
#3  0x000000000044d53c in runtime.sigtrampgo (sig=0, info=0x2020c6, ctx=0x6) at /opt/xxx/go/src/runtime/signal_unix.go:432
#4  0x0000000000469a54 in runtime.sigtramp () at /opt/xxx/go/src/runtime/sys_linux_arm64.s:462

I use the same test case in https://github.com/golang/go/commit/de5b418bea70aaf27de1f47e9b5813940d1e15a4

miles-byted commented 2 months ago

@cherrymui It seems that https://github.com/golang/go/commit/a0c9d153e0c177677701b8a4e6e5eba5a6c44a4f this commit caused it.

I can get at least the SIGSEGV point function if I revert the commit:

admin@lambda:~$ gdb -nx -batch -ex bt cgo-crash /var/core/cgo-crash.1724910818.2258033.core
Program terminated with signal SIGABRT, Aborted.
#0  runtime.raise () at /opt/xxx/go/src/runtime/sys_linux_arm64.s:158
158     /opt/xxx/go/src/runtime/sys_linux_arm64.s: No such file or directory.
[Current thread is 1 (Thread 0x7fb4998010 (LWP 2258033))]
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /home/admin/cgo-crash.
Use `info auto-load python-scripts [REGEXP]' to list them.
#0  runtime.raise () at /opt/xxx/go/src/runtime/sys_linux_arm64.s:158
#1  0x000000000044cee8 in runtime.dieFromSignal (sig=6) at /opt/xxx/go/src/runtime/signal_unix.go:937
#2  0x00000000004352bc in runtime.crash () at /opt/xxx/go/src/runtime/signal_unix.go:1006
#3  runtime.fatalthrow.func1 () at /opt/xxx/go/src/runtime/panic.go:1203
#4  0x0000000000435238 in runtime.fatalthrow (t=<optimized out>) at /opt/xxx/go/src/runtime/panic.go:1192
#5  0x0000000000434e30 in runtime.throw (s=...) at /opt/xxx/go/src/runtime/panic.go:1023
#6  0x000000000044ce6c in runtime.sigpanic () at /opt/xxx/go/src/runtime/signal_unix.go:866
#7  0x000000000048d9f4 in test () at test.c:6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

It exactly shows test () at test.c:6.

test.c:

#include <stdio.h>
#include <stddef.h>
void test()
{
    int* ptr = NULL;
    *ptr = 1024;
}
void trigger_crash()
{
    printf("hello world\n");
    test();
}

test.h:

#ifndef FDDE6B57_4166_4D0B_9BED_C9BF03D209B8
#define FDDE6B57_4166_4D0B_9BED_C9BF03D209B8

void trigger_crash();

#endif /* FDDE6B57_4166_4D0B_9BED_C9BF03D209B8 */

main.go:

package main

/*
#include <test.h>
*/
import "C"
import (
    "fmt"
    "os"
    "os/signal"
    "runtime/debug"
    "syscall"
)

func enableCore() {
    debug.SetTraceback("crash")

    var lim syscall.Rlimit
    err := syscall.Getrlimit(syscall.RLIMIT_CORE, &lim)
    if err != nil {
        panic(fmt.Sprintf("error getting rlimit: %v", err))
    }
    lim.Cur = lim.Max
    fmt.Fprintf(os.Stderr, "Setting RLIMIT_CORE = %+#v\n", lim)
    err = syscall.Setrlimit(syscall.RLIMIT_CORE, &lim)
    if err != nil {
        panic(fmt.Sprintf("error setting rlimit: %v", err))
    }
    signal.Ignore(syscall.SIGABRT)
}

func main() {
    enableCore()
    C.trigger_crash()
}
zzkcode commented 2 months ago

Hi @miles-byted ,

It seems more like a stack-unwind issue to me. I try the test case code on amd64 and arm64, dlv show correct backtraces in both archs, but gdb does not on arm64. All tests using go1.23.0.

I'm not familiar with arm64 and stack-unwind, just a guess. And maybe you could create a new issue for this?

Details

arm64 threads and backtraces:

(gdb) i threads
  Id   Target Id                         Frame
* 1    Thread 0xffffa6cbf010 (LWP 59900) runtime.raise ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_arm64.s:158
  2    Thread 0xffff6004f1c0 (LWP 59901) runtime.usleep ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_arm64.s:138
  3    Thread 0xffff5f80e1c0 (LWP 59902) runtime.usleep ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_arm64.s:138
  4    Thread 0xffff5f00d1c0 (LWP 59903) runtime.usleep ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_arm64.s:138
  5    Thread 0xffff5e7cc1c0 (LWP 59904) runtime.usleep ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_arm64.s:138
  6    Thread 0xffff5dfcb1c0 (LWP 59905) runtime.usleep ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_arm64.s:138

(gdb) bt
#0  runtime.raise () at /root/sdk/go1.23.0/src/runtime/sys_linux_arm64.s:158
#1  0x000000000044cde4 in runtime.dieFromSignal (sig=6)
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:942
#2  0x000000000044d3d0 in runtime.sigfwdgo (sig=6, info=<optimized out>,
    ctx=<optimized out>, ~r0=<optimized out>)
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:1154
#3  0x000000000044c00c in runtime.sigtrampgo (sig=0, info=0xe9fc, ctx=0x6)
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:432
#4  0x0000000000470884 in runtime.sigtramp ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_arm64.s:462

amd64 threads and backtrace:

(gdb) i threads
  Id   Target Id                        Frame
* 1    Thread 0x7f1fde32a740 (LWP 4132) 0x000000000046f886 in runtime.sigtramp ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:352
  2    Thread 0x7f1f95800700 (LWP 4136) runtime.usleep ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:135
  3    Thread 0x7f1f96c00700 (LWP 4134) runtime.usleep ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:135
  4    Thread 0x7f1f94e00700 (LWP 4137) runtime.usleep ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:135
  5    Thread 0x7f1f97600700 (LWP 4133) runtime.usleep ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:135
  6    Thread 0x7f1f96200700 (LWP 4135) runtime.usleep ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:135

(gdb) bt
#0  runtime.raise () at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:154
#1  0x000000000044ad85 in runtime.dieFromSignal (sig=6)
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:942
#2  0x000000000044b3e6 in runtime.sigfwdgo (sig=6, info=<optimized out>, ctx=<optimized out>,
    ~r0=<optimized out>) at /root/sdk/go1.23.0/src/runtime/signal_unix.go:1154
#3  0x0000000000449d85 in runtime.sigtrampgo (sig=0, info=0x0, ctx=0x46f5a1 <runtime.raise+33>)
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:432
#4  0x000000000046f886 in runtime.sigtramp ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:352
#5  <signal handler called>
#6  runtime.raise () at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:154
#7  0x000000000044ad85 in runtime.dieFromSignal (sig=6)
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:942
#8  0x000000000044a906 in runtime.crash () at /root/sdk/go1.23.0/src/runtime/signal_unix.go:1031
#9  runtime.sighandler (sig=<optimized out>, info=<optimized out>, ctxt=<optimized out>,
    gp=<optimized out>) at /root/sdk/go1.23.0/src/runtime/signal_unix.go:806
#10 0x0000000000449e86 in runtime.sigtrampgo (sig=11, info=0xc00000fbf0, ctx=0xc00000fac0)
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:490
#11 0x000000000046f886 in runtime.sigtramp ()
    at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:352
#12 <signal handler called>
#13 0x000000000049a9d1 in trigger_crash ()
    at /container_share/cs-practices/gogo/go_cgo_crash_new/main.go:9
#14 0x000000000049a9fc in _cgo_517cb600b495_Cfunc_trigger_crash (v=0xc000060f20)
    at cgo-gcc-prolog:51
#15 0x000000000046da44 in runtime.asmcgocall () at /root/sdk/go1.23.0/src/runtime/asm_amd64.s:923
#16 0x000000c0000061c0 in ?? ()
#17 0x000000000046be0a in runtime.systemstack () at /root/sdk/go1.23.0/src/runtime/asm_amd64.s:514
#18 0x00007fffccafb0f8 in ?? ()
#19 0x000000000047041f in runtime.newproc (fn=0x46bc8f <runtime.rt0_go+303>) at <autogenerated>:1
#20 0x000000000046bd05 in runtime.mstart () at /root/sdk/go1.23.0/src/runtime/asm_amd64.s:395
#21 0x000000000046bc8f in runtime.rt0_go () at /root/sdk/go1.23.0/src/runtime/asm_amd64.s:358
#22 0x0000000000000001 in ?? ()
#23 0x00007fffccafb228 in ?? ()
#24 0x00007fffccafb220 in ?? ()
#25 0x0000000000000001 in ?? ()
#26 0x00007fffccafb228 in ?? ()
#27 0x00007f1fde351083 in __libc_start_main (main=0x46bb40 <main>, argc=1, argv=0x7fffccafb228,
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffccafb218)
    at ../csu/libc-start.c:308
#28 0x000000000040231e in _start ()

dlv on amd64:

(dlv) bt
 0  0x000000000046f5a1 in runtime.raise
    at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:154
 1  0x000000000044ad85 in runtime.dieFromSignal
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:942
 2  0x000000000044b3e6 in runtime.sigfwdgo
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:1154
 3  0x0000000000449d85 in runtime.sigtrampgo
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:432
 4  0x000000000046f5a1 in runtime.raise
    at /root/sdk/go1.23.0/src/runtime/sys_linux_amd64.s:153
 5  0x000000000044ad85 in runtime.dieFromSignal
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:942
 6  0x000000000044a906 in runtime.crash
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:1031
 7  0x000000000044a906 in runtime.sighandler
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:806
 8  0x0000000000449e86 in runtime.sigtrampgo
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:490
 9  0x000000000049a9d1 in C.trigger_crash
    at ./main.go:9
10  0x000000000049a9fc in C._cgo_517cb600b495_Cfunc_trigger_crash
    at /tmp/go-build/cgo-gcc-prolog:51
11  0x000000000046da44 in runtime.asmcgocall
    at /root/sdk/go1.23.0/src/runtime/asm_amd64.s:923
12  0x0000000000000000 in ???
    at ?:-1
13  0x0000000000462575 in runtime.cgocall
    at /root/sdk/go1.23.0/src/runtime/cgocall.go:185
14  0x000000000049a75f in main._Cfunc_trigger_crash
    at _cgo_gotypes.go:43
15  0x000000000049a957 in main.main
    at ./main.go:38
16  0x000000000043710b in runtime.main
    at /root/sdk/go1.23.0/src/runtime/proc.go:272
17  0x000000000046ddc1 in runtime.goexit
    at /root/sdk/go1.23.0/src/runtime/asm_amd64.s:1700

dlv on arm64:

(dlv) bt
 0  0x00000000004704d8 in runtime.raise                                               at /root/sdk/go1.23.0/src/runtime/sys_linux_arm64.s:158
 1  0x000000000044cde4 in runtime.dieFromSignal
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:942
 2  0x000000000044d3d0 in runtime.sigfwdgo
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:1154
 3  0x000000000044c00c in runtime.sigtrampgo
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:432
 4  0x00000000004704d8 in runtime.raise
    at /root/sdk/go1.23.0/src/runtime/sys_linux_arm64.s:157
 5  0x000000000044cde4 in runtime.dieFromSignal
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:942
 6  0x000000000044cf84 in runtime.crash
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:1031
 7  0x000000000044c9dc in runtime.sighandler
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:806
 8  0x000000000044c0d0 in runtime.sigtrampgo
    at /root/sdk/go1.23.0/src/runtime/signal_unix.go:490
 9  0x00000000004b272c in C.trigger_crash
    at ./main.go:9
10  0x00000000004b2754 in C._cgo_517cb600b495_Cfunc_trigger_crash
    at /tmp/go-build/cgo-gcc-prolog:51
11  0x000000000046f4cc in runtime.asmcgocall
    at /root/sdk/go1.23.0/src/runtime/asm_arm64.s:1000
12  0x00000000004641c8 in runtime.cgocall
    at /root/sdk/go1.23.0/src/runtime/cgocall.go:185
13  0x00000000004b23b0 in main._Cfunc_trigger_crash
    at _cgo_gotypes.go:43
14  0x00000000004b26d4 in main.main
    at ./main.go:38
15  0x0000000000439064 in runtime.main
    at /root/sdk/go1.23.0/src/runtime/proc.go:272
16  0x000000000046f6d4 in runtime.goexit
    at /root/sdk/go1.23.0/src/runtime/asm_arm64.s:1223