golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.59k stars 17.61k forks source link

runtime: process crash instead of panic on SIGBUS with SetPanicOnDefault(true) #41155

Open florisch opened 4 years ago

florisch commented 4 years ago

What version of Go are you using (go version)?

$ go version
go version go1.15 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=arm
set GOBIN=
set GOCACHE=C:\Users\Florian\AppData\Local\go-build
set GOENV=C:\Users\Florian\AppData\Roaming\go\env
set GOEXE=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\Florian\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=linux
set GOPATH=C:\Users\Florian\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=c:\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=c:\go\pkg\tool\windows_amd64
set GCCGO=gccgo
set GOARM=7
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=0
set GOMOD=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-fPIC -marm -fmessage-length=0 -fdebug-prefix-map=C:\Users\Florian\AppData\Local\Temp\go-build326602894=/tmp/go-build -gno-record-gcc-switches
GOROOT/bin/go version: go version go1.15 windows/amd64
GOROOT/bin/go tool compile -V: compile version go1.15
gdb --version: GNU gdb (GDB) 8.1

What did you do?

We are using Go for some embedded development (cross compiled to linux arm32). We access various FPGA registers from the Go process. In order to access those registers, we use mmap /dev/mem at the address space of those registers.

When we access registers which are not defined/accessible in the FPGA, the process crash with the error reported below.

We use defer debug.SetPanicOnFault(debug.SetPanicOnFault(true)) in the stack which makes the register read as we expect this to make the runtime panic instead of crash on this kind of memory fault.

What did you expect to see?

A panic where the bad access happened. This way, with a recover call, it would be possible to handle the case where some registers are not available.

What did you see instead?

The process crash, in an unrecoverable way, with the following output:

Unhandled fault: external abort on non-linefetch (0x018) at 0x26b48010
pgd = 5e090000
[26b48010] *pgd=1e234831, *pte=40040703, *ppte=40040e33
runner.sh: SIGBUS: bus error
runner.sh: PC=0x2a8ff0 m=0 sigcode=0
runner.sh: goroutine 43 [running]:
runner.sh: gobv1/pkg/hw/pmem.Access.ReadUint32(...)
runner.sh:      C:/projects/ellisys/bv1go/pkg/hw/pmem/memAccess_linux.go:122
runner.sh: gobv1/pkg/hw/pmem.(*Access).ReadUint32(0x925200, 0x10, 0x28e594)
runner.sh:      <autogenerated>:1 +0x44 fp=0x8a6acc sp=0x8a6aa4 pc=0x2a8ff0
...
runner.sh: main.(*command).initializeDevice(0x9222c0, 0x922b80)
runner.sh:      C:/projects/ellisys/bv1go/cmd/gobv1/main.go:154 +0x94 fp=0x8a6fe4 sp=0x8a6fa0 pc=0x371320
runner.sh: runtime.goexit()
...
runner.sh: goroutine 20 [select]:
runner.sh: io.(*pipe).Read(0x922280, 0x84c000, 0x1000, 0x1000, 0x3b50e8, 0x1136b0, 0x84c000)
runner.sh:      C:/Go/src/io/pipe.go:57 +0xac
...
runner.sh: goroutine 42 [runnable]:
...
runner.sh: trap    0x0
runner.sh: error   0x18
runner.sh: oldmask 0x0
runner.sh: r0      0x26b48000
runner.sh: r1      0x3c
runner.sh: r2      0x8a6acc
runner.sh: r3      0x10
runner.sh: r4      0x1
runner.sh: r5      0x1
runner.sh: r6      0xf1
runner.sh: r7      0x26ccc521
runner.sh: r8      0x925200
runner.sh: r9      0x20
runner.sh: r10     0x883500
runner.sh: fp      0x7
runner.sh: ip      0x925203
runner.sh: sp      0x8a6aa4
runner.sh: lr      0x2a8fdc
runner.sh: pc      0x2a8ff0
runner.sh: cpsr    0x80000010
runner.sh: fault   0x0
runner.sh: Program instance execution terminated

Workaround

I build a custom runtime with this commit which makes the call panic as expected.

tpaschalis commented 4 years ago

I'm not sure how to replicate this failure, but I'd like to give this a shot.

Do we think that the posted workaround is something that could also be long-term solution?

ianlancetaylor commented 4 years ago

Please avoid looking at the workaround (and, everyone, please avoid posting patches through the issue tracker). We want patches to only come in as Gerritt code reviews or GitHub pull requests, because then we have automation that confirms that the copyright assignments are in order. Thanks.

To put it another way, I can't answer your question about the posted workaround because I'm not going to look at it. Sorry.

I think you might be able to write a test that gets a SIGBUS by using mmap to map memory as read-only and then trying to write to it. I'm not really sure, though.

tpaschalis commented 4 years ago

Thanks for the pointers, I'll try to get a repro done, and then see how the issue can be fixed!

florisch commented 4 years ago

Thank you for looking into this. I tough I should open a ticket for discussion before creating a PR. Sorry if I didn't respect the rules by adding a link to my workaround commit in the ticket.

If desired, I would be happy to contribute to fix this issue and make a PR. For now, I try to find a way to write a test which could be integrated with the regular test suite to reproduce this issue without our embedded FPGA platform.

I created a test doing what @ianlancetaylor suggested. Doing this doesn't reproduce the issue. This result in the expected panic: runtime error: invalid memory address or nil pointer dereference (both on a linux desktop and on our embedded platform).

networkimprov commented 4 years ago

More ideas: https://stackoverflow.com/questions/2089167/debugging-sigbus-on-x86-linux

tpaschalis commented 4 years ago

For now, I try to find a way to write a test which could be integrated with the regular test suite to reproduce this issue without our embedded FPGA platform.

This would a good first step; I hope I can assist in that as well. (and also, thanks for having a positive attitude to getting to the bottom of this!)

The following code uses CGO and triggers a SIGBUS. I tried it on darwin and linux, but could not get the same error. This happens both with and without the debug.SetPanicOnFault(debug.SetPanicOnFault(true)) line. On the other hand CGO is a different beast, and maybe that's why the same error does not appear.

Code : https://play.golang.org/p/vWdhf2mtuEq
Output :

fatal error: unexpected signal during runtime execution
[signal SIGBUS: bus error code=0x2 addr=0x7ff893c38000 pc=0x46ca3d]

runtime stack:
runtime.throw(0x48bc4c, 0x2a)
    /usr/local/go/src/runtime/panic.go:1116 +0x72
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:704 +0x4ac

EDIT: Here's the same using syscall.Mmap instead of CGO. Code : https://play.golang.org/p/aYAsrUND0i_D
Output with debug.SetPanicOnFault

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGBUS: bus error code=0x2 addr=0x7f3ac1f2c000 pc=0x46ceac]

Output without debug.SetPanicOnFault

unexpected fault address 0x7fe33aac3000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7fe33aac3000 pc=0x494db6]
florisch commented 4 years ago

I tried the code using mmap on our embedded platform, and see the same behavior. Then I modified the runtime to print the flags and the sigcode when a SIGBUS is received.

Output of SIGBUS generated by sample from previous comment

SIGBUS flags=0x0x88 sigcode=0x2

Output with SIGBUS generated by a bad register access

SIGBUS flags=0x0x88 sigcode=0x0

Since sigcode 0 match with _SI_USER, it is not handled properly in the case of our bad register access while it is handled properly when generated using code from previous comment.

florisch commented 4 years ago

Here is a minimal code which reproduce the issue on armv7. The same code on amd64 doesn't reproduce the issue as mmap simply refuse to mmap bad addresses.

https://play.golang.org/p/Zbi9pBZ3rKu Output:

SIGBUS flags=0x0x88 sigcode=0x 0x0
SIGBUS: bus error
PC=0xa2728 m=0 sigcode=0

goroutine 1 [running]:
main.main()
        gobv1/tools/crash/main.go:37 +0x240 fp=0x4227b8 sp=0x422740 pc=0xa2728
runtime.main()
        runtime/proc.go:205 +0x208 fp=0x4227e4 sp=0x4227b8 pc=0x427f8
runtime.goexit()
        runtime/asm_arm.s:857 +0x4 fp=0x4227e4 sp=0x4227e4 pc=0x6d8f0

trap    0x0
error   0x1818
oldmask 0x0
r0      0x0
r1      0x1000
r2      0x26c2c000
r3      0x0
r4      0x4
r5      0x0
r6      0x26c2cfff
r7      0x0
r8      0x7
r9      0x1
r10     0x4000e0
fp      0x14d078
ip      0xd
sp      0x422740
lr      0x119f0
pc      0xa2728
cpsr    0x20000010
fault   0x0
odeke-em commented 3 years ago

Punting to Go1.17, thank you all for the patience, and for the discussion, please keep it going.

ianlancetaylor commented 3 years ago

I don't understand why the kernel would send a signal with si_code set to SI_USER. That seems like a kernel bug. The SI_USER code is supposed to indicate an explicit use of the kill system call. I don't mind working around a kernel bug but we don't want to treat all SIGBUS signals with si_code == SI_USER as indicating an actual bus error.

shakefu commented 7 months ago

Since this has been around forever, I'd like just to add you can reliably trigger a crashing SIGBUS even when trying to recover the panic by writing to PROT_READ mmap'd memory. I can trigger it 100% of the time using gommap on Darwin arm64. Not sure if that helps with debugging and finding a handler.

dominikh commented 6 months ago

For linux/arm64, https://github.com/torvalds/linux/commit/526c3ddb6aa270fe6f71d227eac1e189746cc257 and https://github.com/torvalds/linux/commit/af40ff687bc9d351030685fde2f57ba45ab4fc14 come to mind (which suggest to me that this used to be a problem that got fixed.) I can't speak to linux/arm32 or darwin/arm64. I cannot reproduce the problem on linux/amd64.