Open florisch opened 4 years ago
I'm not sure how to replicate this failure, but I'd like to give this a shot.
Do we think that the posted workaround is something that could also be long-term solution?
Please avoid looking at the workaround (and, everyone, please avoid posting patches through the issue tracker). We want patches to only come in as Gerritt code reviews or GitHub pull requests, because then we have automation that confirms that the copyright assignments are in order. Thanks.
To put it another way, I can't answer your question about the posted workaround because I'm not going to look at it. Sorry.
I think you might be able to write a test that gets a SIGBUS
by using mmap
to map memory as read-only and then trying to write to it. I'm not really sure, though.
Thanks for the pointers, I'll try to get a repro done, and then see how the issue can be fixed!
Thank you for looking into this. I tough I should open a ticket for discussion before creating a PR. Sorry if I didn't respect the rules by adding a link to my workaround commit in the ticket.
If desired, I would be happy to contribute to fix this issue and make a PR. For now, I try to find a way to write a test which could be integrated with the regular test suite to reproduce this issue without our embedded FPGA platform.
I created a test doing what @ianlancetaylor suggested. Doing this doesn't reproduce the issue. This result in the expected panic: runtime error: invalid memory address or nil pointer dereference
(both on a linux desktop and on our embedded platform).
For now, I try to find a way to write a test which could be integrated with the regular test suite to reproduce this issue without our embedded FPGA platform.
This would a good first step; I hope I can assist in that as well. (and also, thanks for having a positive attitude to getting to the bottom of this!)
The following code uses CGO and triggers a SIGBUS. I tried it on darwin and linux, but could not get the same error. This happens both with and without the debug.SetPanicOnFault(debug.SetPanicOnFault(true))
line. On the other hand CGO is a different beast, and maybe that's why the same error does not appear.
Code : https://play.golang.org/p/vWdhf2mtuEq
Output :
fatal error: unexpected signal during runtime execution
[signal SIGBUS: bus error code=0x2 addr=0x7ff893c38000 pc=0x46ca3d]
runtime stack:
runtime.throw(0x48bc4c, 0x2a)
/usr/local/go/src/runtime/panic.go:1116 +0x72
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:704 +0x4ac
EDIT: Here's the same using syscall.Mmap
instead of CGO.
Code : https://play.golang.org/p/aYAsrUND0i_D
Output with debug.SetPanicOnFault
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGBUS: bus error code=0x2 addr=0x7f3ac1f2c000 pc=0x46ceac]
Output without debug.SetPanicOnFault
unexpected fault address 0x7fe33aac3000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7fe33aac3000 pc=0x494db6]
I tried the code using mmap on our embedded platform, and see the same behavior. Then I modified the runtime to print the flags and the sigcode when a SIGBUS is received.
Output of SIGBUS generated by sample from previous comment
SIGBUS flags=0x0x88 sigcode=0x2
Output with SIGBUS generated by a bad register access
SIGBUS flags=0x0x88 sigcode=0x0
Since sigcode 0 match with _SI_USER
, it is not handled properly in the case of our bad register access while it is handled properly when generated using code from previous comment.
Here is a minimal code which reproduce the issue on armv7. The same code on amd64 doesn't reproduce the issue as mmap simply refuse to mmap bad addresses.
https://play.golang.org/p/Zbi9pBZ3rKu Output:
SIGBUS flags=0x0x88 sigcode=0x 0x0
SIGBUS: bus error
PC=0xa2728 m=0 sigcode=0
goroutine 1 [running]:
main.main()
gobv1/tools/crash/main.go:37 +0x240 fp=0x4227b8 sp=0x422740 pc=0xa2728
runtime.main()
runtime/proc.go:205 +0x208 fp=0x4227e4 sp=0x4227b8 pc=0x427f8
runtime.goexit()
runtime/asm_arm.s:857 +0x4 fp=0x4227e4 sp=0x4227e4 pc=0x6d8f0
trap 0x0
error 0x1818
oldmask 0x0
r0 0x0
r1 0x1000
r2 0x26c2c000
r3 0x0
r4 0x4
r5 0x0
r6 0x26c2cfff
r7 0x0
r8 0x7
r9 0x1
r10 0x4000e0
fp 0x14d078
ip 0xd
sp 0x422740
lr 0x119f0
pc 0xa2728
cpsr 0x20000010
fault 0x0
Punting to Go1.17, thank you all for the patience, and for the discussion, please keep it going.
I don't understand why the kernel would send a signal with si_code
set to SI_USER
. That seems like a kernel bug. The SI_USER
code is supposed to indicate an explicit use of the kill
system call. I don't mind working around a kernel bug but we don't want to treat all SIGBUS
signals with si_code == SI_USER
as indicating an actual bus error.
Since this has been around forever, I'd like just to add you can reliably trigger a crashing SIGBUS even when trying to recover the panic by writing to PROT_READ mmap'd memory. I can trigger it 100% of the time using gommap on Darwin arm64. Not sure if that helps with debugging and finding a handler.
For linux/arm64, https://github.com/torvalds/linux/commit/526c3ddb6aa270fe6f71d227eac1e189746cc257 and https://github.com/torvalds/linux/commit/af40ff687bc9d351030685fde2f57ba45ab4fc14 come to mind (which suggest to me that this used to be a problem that got fixed.) I can't speak to linux/arm32 or darwin/arm64. I cannot reproduce the problem on linux/amd64.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
We are using Go for some embedded development (cross compiled to linux arm32). We access various FPGA registers from the Go process. In order to access those registers, we use mmap /dev/mem at the address space of those registers.
When we access registers which are not defined/accessible in the FPGA, the process crash with the error reported below.
We use
defer debug.SetPanicOnFault(debug.SetPanicOnFault(true))
in the stack which makes the register read as we expect this to make the runtime panic instead of crash on this kind of memory fault.What did you expect to see?
A panic where the bad access happened. This way, with a recover call, it would be possible to handle the case where some registers are not available.
What did you see instead?
The process crash, in an unrecoverable way, with the following output:
Workaround
I build a custom runtime with this commit which makes the call panic as expected.