Closed keyneston closed 8 years ago
Thanks for the report. Unfortunately I don't see enough here to know what to fix. Are you saying that the Go program simply crashes with "Segmentation fault (core dumped)" and doesn't print anything else? That should (of course) never happen.
There is something odd here. Your go version
output says darwin/amd64, but your backtrace says sys_linux_amd64.s. Which is it?
The backtrace line suggests that the crash is because g != nil
but g.m == nil
or g.m.gsignal == nil
. Not sure how that could happen.
If you can figure out the arguments in the core dump, you can determine where the first SIGSEGV occurred by looking at the ctx argument to sigtrampgo. It will be a pointer to a sigcontext struct, defined in runtime/defs_linux_amd64.go. The rip field should point to the instruction that got the error.
I'm cross compiling from darwin to linux, so compiler is darwin/amd64 and runtime is sys_linux_amd64. Should have mentioned that.
env GOOS=linux go build -o bin/canal example.com/path/canal
Yes the go program simply crashes with Segmentation fault (core dumped)
. No additional info.
I'll poke and see if I can figure out where the first sigsegv is occurring.
Does your program crash if you compile it natively on the target Linux machine?
On Sat, 16 Jan 2016, 10:23 tarrant notifications@github.com wrote:
I'm cross compiling from darwin to linux, so compiler is darwin/amd64 and runtime is sys_linux_amd64. Should have mentioned that.
env GOOS=linux go build -o bin/canal example.com/path/canal
Yes the go program simply crashes with Segmentation fault (core dumped). No additional info.
I'll poke and see if I can figure out where the first sigsegv is occurring.
— Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/13978#issuecomment-172124733.
I've reproduced the issue with a linux built binary. This also happened to test it with go1.6beta2.
Can you please paste the full stack trace from the crash.
On Mon, Jan 18, 2016 at 3:18 AM, tarrant notifications@github.com wrote:
I've reproduced the issue with a linux built binary. This also happened to test it with go1.6beta2.
— Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/13978#issuecomment-172346290.
The go stacktrace (doesn't really exist) but here is what it prints out:
INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
INFO[0000] PacketDecoder started
Segmentation fault (core dumped)
gdb:
tarrant@rep-testing-826198:~$ gdb canal-1453060415 core-1453060415
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/users/tarrant/canal-1453060415...done.
warning: core file may not match specified executable file.
[New LWP 34971]
[New LWP 34975]
[New LWP 35010]
[New LWP 35005]
[New LWP 34974]
[New LWP 34977]
[New LWP 34979]
[New LWP 34998]
[New LWP 34968]
[New LWP 34978]
[New LWP 34976]
[New LWP 34981]
[New LWP 34985]
[New LWP 34980]
[New LWP 34990]
[New LWP 34987]
[New LWP 34982]
[New LWP 34984]
[New LWP 34995]
[New LWP 34986]
[New LWP 34989]
[New LWP 34992]
[New LWP 34999]
[New LWP 35000]
[New LWP 34997]
[New LWP 35007]
[New LWP 34996]
[New LWP 35006]
[New LWP 35008]
[New LWP 36191]
[New LWP 35018]
[New LWP 34993]
[New LWP 34969]
[New LWP 35009]
[New LWP 34970]
[New LWP 35516]
[New LWP 34972]
[New LWP 34991]
[New LWP 34983]
[New LWP 34973]
[New LWP 34994]
[New LWP 35001]
[New LWP 34988]
[New LWP 35003]
[New LWP 35221]
[New LWP 35024]
[New LWP 35002]
[New LWP 35004]
[New LWP 35011]
warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./canal -log debug
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000446ddf in runtime.sigtrampgo (sig=11, info=0xc820095bb0, ctx=0xc820095a80)
at /usr/local/go/src/runtime/signal_sigtramp.go:27
27 /usr/local/go/src/runtime/signal_sigtramp.go: No such file or directory.
warning: Missing auto-load scripts referenced in section .debug_gdb_scripts
of file /home/users/tarrant/canal-1453060415
Use `info auto-load-scripts [REGEXP]' to list them.
(gdb) bt
#0 0x0000000000446ddf in runtime.sigtrampgo (sig=11, info=0xc820095bb0, ctx=0xc820095a80)
at /usr/local/go/src/runtime/signal_sigtramp.go:27
#1 0x000000000046533b in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:234
#2 0x0000000000465340 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:235
#3 0x0000000000000001 in ?? ()
#4 0x0000000000000000 in ?? ()
(gdb)
I don't understand. There should be a panic from Go, especially for SEGV.
Are you sure there is no cgo in your project ? If you're doing packet decoding are you using a library which is linking against libpcap? Is the source of your project available ?
On Mon, Jan 18, 2016 at 7:57 AM, tarrant notifications@github.com wrote:
The go stacktrace (doesn't really exist) but here is what it prints out:
INFO[0000] PacketDecoder started INFO[0000] PacketDecoder started INFO[0000] PacketDecoder started INFO[0000] PacketDecoder started INFO[0000] PacketDecoder started Segmentation fault (core dumped)
gdb:
tarrant@rep-testing-826198:~$ gdb canal-1453060415 core-1453060415 GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: http://bugs.launchpad.net/gdb-linaro/... Reading symbols from /home/users/tarrant/canal-1453060415...done.
warning: core file may not match specified executable file. [New LWP 34971] [New LWP 34975] [New LWP 35010] [New LWP 35005] [New LWP 34974] [New LWP 34977] [New LWP 34979] [New LWP 34998] [New LWP 34968] [New LWP 34978] [New LWP 34976] [New LWP 34981] [New LWP 34985] [New LWP 34980] [New LWP 34990] [New LWP 34987] [New LWP 34982] [New LWP 34984] [New LWP 34995] [New LWP 34986] [New LWP 34989] [New LWP 34992] [New LWP 34999] [New LWP 35000] [New LWP 34997] [New LWP 35007] [New LWP 34996] [New LWP 35006] [New LWP 35008] [New LWP 36191] [New LWP 35018] [New LWP 34993] [New LWP 34969] [New LWP 35009] [New LWP 34970] [New LWP 35516] [New LWP 34972] [New LWP 34991] [New LWP 34983] [New LWP 34973] [New LWP 34994] [New LWP 35001] [New LWP 34988] [New LWP 35003] [New LWP 35221] [New LWP 35024] [New LWP 35002] [New LWP 35004] [New LWP 35011]
warning: Can't read pathname for load map: Input/output error. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `./canal -log debug Program terminated with signal 11, Segmentation fault.
0 0x0000000000446ddf in runtime.sigtrampgo (sig=11, info=0xc820095bb0, ctx=0xc820095a80)
at /usr/local/go/src/runtime/signal_sigtramp.go:27
27 /usr/local/go/src/runtime/signal_sigtramp.go: No such file or directory. warning: Missing auto-load scripts referenced in section .debug_gdb_scripts of file /home/users/tarrant/canal-1453060415 Use`info auto-load-scripts [REGEXP]' to list them. (gdb) bt
0 0x0000000000446ddf in runtime.sigtrampgo (sig=11, info=0xc820095bb0, ctx=0xc820095a80)
at /usr/local/go/src/runtime/signal_sigtramp.go:27
1 0x000000000046533b in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:234
2 0x0000000000465340 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:235
3 0x0000000000000001 in ?? ()
4 0x0000000000000000 in ?? ()
(gdb)
— Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/13978#issuecomment-172379757.
We're unfortunately not able to share the source code for the program. There's no cgo in this program outside of the core net and runtime/race packages. The packet processing is done on UDP packets in pure Go, not on SOCK_RAW with libpcap. The race detector has not found any data races in the program.
The program crashes with SIGSEGV in runtime.sigtrampgo, apparently because either g.m or g.m.gsignal is nil. The sig
argument to sigtrampgo is 11, indicating that it was called in response to a SIGSEGV from elsewhere in the program. The ctx argument should point to a runtime.sigcontext struct, so we look at the memory it points to. The rip
field in the core dumps is indeed a pointer, but not to a valid instruction. We have an example of it pointing into the .rodata section (to the string "GC worker (idle)"), and an example of it pointing to a runtime.g value (with 0xfffffffffffffade showing up after 16 bytes, where we'd expect the runtime.g's stackguard0 field to be).
I suspect that the instruction pointer for the first SIGSEGV is being accurately reported; that is, the original SIGSEGV is caused by an attempt to execute non-executable memory. I think this is likely due to memory corruption in the scheduler's datastructures.
How can we hunt down corruption in the scheduler's memory? Does corruption there sound like a plausible explanation for the crashes we see?
Can you reproduce the problem on previous versions of Go 1.5.3 or 1.4.2?
How can we hunt down corruption in the scheduler's memory? Does corruption there sound like a plausible explanation for the crashes we see?
Possible, but unlikely, corruption of this sort usually comes from data races in the program.
If you suspect that your tests do not exercise possible racy paths, could you build a race enabled version of your program and deploy it ? go build -race $PKG will do it.
Can you run the process under gdb? That should stop at the point of the first segfault rather than when go's handler breaks and may be easier to find out what's going on. Alternatively, I guess, you could disable go's handling of SIGSEGV entirely.
As to figuring out why you're jumping into the weeds, I guess if you're lucky you're jumping via CALL and the address you jumped from is on the top of the stack. If not, you might still be able to figure out where you were in the code by looking through the stack, though of course if the jump is from runtime·gogo or something like that this won't help.
The program has been running with the race detector for the past hour with no races reported. It hasn't encountered the SIGSEGV yet, but all other periodic code paths should have been exercised a few hundred times.
We can run the process under gdb, though it may take a day before we see the crash again. Disabling Go's SIGSEGV handling to force an immediate core dump sounds attractive. Can this be done via signal.Ignore(syscall.SIGSEGV)
, or does it require calling sigaction(2) directly?
We can run the process under gdb, though it may take a day before we see the crash again. Disabling Go's SIGSEGV handling to force an immediate core dump sounds attractive. Can this be done via signal.Ignore(syscall.SIGSEGV), or does it require calling sigaction(2) directly?
I don't actually know the answer to this. I was thinking of hacking out the runtime code that installs the SIGSEGV handler...
The goroutine that's segfaulting is one of the GC's background mark workers. Both usable core dumps we have show the same thing (below). runtime.sigtrampgo
is called with the sig
argument set to 11 for SIGSEGV, and the contents of the info
argument indicate signal 11 with si_code SEGV_MAPERR indicating that the fault is due to an unmapped address and the fault address is 0xd8.
go tool objdump
indicates that the result of the call to getg
is stored in the AX register.
$ gdb {program,core}-1452891329
[snip]
Program terminated with signal 11, Segmentation fault.
#0 0x000000000044930e in runtime.sigtrampgo (sig=11, info=0xc820d3dbb0, ctx=0xc820d3da80) at /usr/local/Cellar/go/HEAD/libexec/src/runtime/signal_sigtramp.go:27
[snip]
(gdb) p info.si_signo
$1 = 11
(gdb) p info.si_code
$2 = 1
(gdb) p/x info.si_addr
$3 = 0xd8
(gdb) info symbol (('runtime.g'*)$rax).gopc
runtime.gcBgMarkStartWorkers + 159 in section .text of /path/to/program-1452891329
(gdb) info symbol (('runtime.g'*)$rax).startpc
runtime.gcBgMarkWorker in section .text of /path/to/program-1452891329
$ gdb {program,core}-1453060415
[snip]
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000446ddf in runtime.sigtrampgo (sig=11, info=0xc820095bb0, ctx=0xc820095a80) at /usr/local/go/src/runtime/signal_sigtramp.go:27
[snip]
(gdb) p info.si_signo
$1 = 11
(gdb) p info.si_code
$2 = 1
(gdb) p/x info.si_addr
$3 = 0xd8
(gdb) info symbol (('runtime.g'*)$rax).gopc
runtime.gcBgMarkStartWorkers + 147 in section .text of /path/to/program-1453060415
(gdb) info symbol (('runtime.g'*)$rax).startpc
runtime.gcBgMarkWorker in section .text of /path/to/program-1453060415
I think I've found the cause of our issues. The G that runs the GC background mark worker is sometimes running without an M. When that goroutine calls runtime.acquirem
, it's unable to dereference _g_.m
, which leads to a SIGSEGV at address 0xd8 when running the following instructions:
FS MOVQ FS:0xfffffff8, AX
MOVQ 0x30(AX), CX
MOVL 0xd8(CX), R8
The ctx
argument to sigtrampgo doesn't point right to a *runtime.sigcontext
- it points to 40 bytes before a *runtime.sigcontext
, so all of the field names were wrong. Adding 0x28 to the pointer results in much more believable register contents, such as 1) an instruction pointer that points to an instruction in the text segment, and 2) a stack pointer that points to an address between stack.lo and stack.hi:
devel +505fa7b Tue Jan 12 17:31:01 2016 +0000
$ gdb {program,core}-1452891329
[snip]
(gdb) p/x (('runtime.g'*)$rax).stack.lo
$1 = 0xc82024c000
(gdb) p/x (('runtime.g'*)$rax).stack.hi
$2 = 0xc82024c7e0
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rsp
$3 = 0xc82024c740
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rip
$4 = 0x41f9f8
(gdb) info symbol (('runtime.sigcontext'*)(ctx+0x28)).rip
runtime.gcBgMarkWorker + 376 in section .text of /path/to/program-1452891329
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rcx
$5 = 0x0
$ go tool objdump /path/to/program-1452891329 | grep -C 3 '0x41f9f8'
mgc.go:1469 0x41f9ea c3 RET
mgc.go:1390 0x41f9eb 64488b0425f8ffffff FS MOVQ FS:0xfffffff8, AX
mgc.go:1390 0x41f9f4 488b4830 MOVQ 0x30(AX), CX
mgc.go:1390 0x41f9f8 448b81d8000000 MOVL 0xd8(CX), R8
mgc.go:1390 0x41f9ff 41ffc0 INCL R8
mgc.go:1390 0x41fa02 448981d8000000 MOVL R8, 0xd8(CX)
mgc.go:1390 0x41fa09 4c8b4030 MOVQ 0x30(AX), R8
go1.6beta2
$ gdb {canal,core}-1453060415
[snip]
(gdb) p/x (('runtime.g'*)$rax).stack.lo
$1 = 0xc8202a7800
(gdb) p/x (('runtime.g'*)$rax).stack.hi
$2 = 0xc8202a7fe0
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rsp
$3 = 0xc8202a7f20
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rip
$4 = 0x434514
(gdb) info symbol (('runtime.sigcontext'*)(ctx+0x28)).rip
runtime.gopark + 36 in section .text of /path/to/program-1453060415
(gdb) p/x (('runtime.sigcontext'*)(ctx+0x28)).rax
$5 = 0x0
$ go tool objdump /path/to/program-1453060415 | grep -C 3 '0x434514'
proc.go:248 0x434503 4883ec20 SUBQ $0x20, SP
proc.go:249 0x434507 64488b0c25f8ffffff FS MOVQ FS:0xfffffff8, CX
proc.go:249 0x434510 488b4130 MOVQ 0x30(CX), AX
proc.go:249 0x434514 8ba8d8000000 MOVL 0xd8(AX), BP
proc.go:249 0x43451a ffc5 INCL BP
proc.go:249 0x43451c 89a8d8000000 MOVL BP, 0xd8(AX)
proc.go:249 0x434522 488b6930 MOVQ 0x30(CX), BP
So to summarize, our program is crashing because a GC background mark worker is running with g.m
set to nil. This causes a segfault, which causes the signal handler to run, which in turn encounters a segfault when it tries to dereference g.m
, resulting in an immediate and unceremonious crash of the program.
The 40 byte offset is because I'd been misinterpreting the meaning of the ctx argument to runtime.sigtrampgo. It's an unsafe.Pointer to a runtime.ucontext value which on amd64 includes a runtime.sigcontext value at a 40-byte offset (rather than a pointer directly to a runtime.sigcontext value). runtime/signal_linux_amd64.go shows how to correctly interpret the contents.
I don't have any further answers for why the GC worker is running without an M.
It looks like this is the same bug as #13363, although the CPU profiler was not involved with these crashes. /cc @aclements
@rhysh thanks for the good investigative work.
@RLH @aclements it appears that in some (uncommon) case gcBgMarkWorker can be started up without an m.
Crash 1452891329 on go version 505fa7b comes in the middle of runtime.gcBgMarkWorker during a call to runtime.acquirem. This isn't the first call to runtime.acquirem in that function, which means that the worker must have had an m when it started and then lost it (otherwise we'd see a segfault at the first call).
Crash 1453060415 on go version go1.6beta2 comes at the call to runtime.acquirem at the start of runtime.gopark. I looked through the stack of the crashed goroutine for instruction addresses and found a 64-bit aligned value of 0x000000000041f0fc at address 0xc8202a7f40, between the stack.hi value of 0xc8202a7fe0 and the rsp value of 0xc8202a7f20. The instruction address corresponds to the return from runtime.gopark at the top of runtime.gcBgMarkWorker's for loop. If I've unwound the stack correctly, this means that gcBgMarkWorker calling gopark calling acquirem led to the second crash.
$ go tool objdump /path/to/program-1453060415 | grep -C 3 '0x41f0fc'
mgc.go:1378 0x41f0e9 c644242014 MOVL $0x14, 0x20(SP)
mgc.go:1378 0x41f0ee 48c744242800000000 MOVQ $0x0, 0x28(SP)
mgc.go:1378 0x41f0f7 e8f4530100 CALL runtime.gopark(SB)
mgc.go:1383 0x41f0fc 488b6c2470 MOVQ 0x70(SP), BP
mgc.go:1383 0x41f101 4883fd00 CMPQ $0x0, BP
mgc.go:1383 0x41f105 0f845d050000 JE 0x41f668
mgc.go:1383 0x41f10b 488b9d10120000 MOVQ 0x1210(BP), BX
The gcBgMarkWorker goroutines had an m when they were created, but had their execution resumed at some point without an m.
Nice debugging! I agree that this is almost certainly the same failure as #13363. You're positive that CPU profiling wasn't enabled? Is it possible the application is receiving other asynchronous signals? (I would love to rule out async signals as a possible cause.)
pprof is installed, but I was not actually calling any of the pprof endpoints for most of the crashes. Many of them I'd start the process close the screen session and then go home for the day to have it crash the next morning.
I don't think any async signals are being called, though they may be under the hood. Most of the syscalls are read/write to network fds.
I'll poke and see if I can find any others but I seriously doubt it.
I can think of a few hypotheses:
We may be able to disprove 1 by looking at where the SIGSEGVs happen relative to possible preemption points. The background mark worker has fairly limited preemption points and since the goroutine can only resume at a preemption point, if there are any successful accesses to g.m between the crash and the previous preemption point, we can rule out 1.
My case from #13363 does not disprove 1. In fact, since we resume from an involuntary preemption at the call to morestack, the crash happened just a few instructions after a possible resume point. This lends support to theory 1, but does not prove it.
@rhysh, in 1452891329, do you know which acquirem crashed?
Crash 1453060415 would seem to disprove 1. As of the acquirem in gopark, the goroutine hasn't been preemptible for quite a while. On all of the code paths that lead from disabling preemption to gopark (of which I think there are three: function entry, regular looping around, and looping around from the gcMarkDone condition), we've done our own acquirem, which would indicate g.m != nil since the last preemption point.
Further evidence against 1 is that execute itself would crash if g.m were nil. However, there are other calls to gogo that wouldn't crash if g.m were nil.
For 2 and 3, we need to consider the places that modify g.m. Ignoring places where we assign g0.m or gsignal.m, there are four: newextram, execute, dropg, and goexit0. newextram always sets it to non-nil (and immediately uses it), so we can discount it. Likewise, execute would crash if g.m were nil. dropg and goexit0 do set g.m to nil.
@rhysh or @tarrant, could you have GDB dump the whole current G?
For 2, it's almost certainly not goexit0. For 3, it could conceivably be goexit0. However, in #13363, it's definitely not goexit0 because the status is _Grunning and the waitstatus is set. Likewise, all of the calls to dropg are immediately preceded by setting the status to something other than _Grunning. Most paths back to _Grunning would crash if g.m were nil, but I'm not sure all of them would. This would be a useful place to dig deeper.
For 4, I'm pretty sure there's nothing specific in the signal code that modifies g.m and I don't think it can call any of the functions I mentioned above. So 4 seems unlikely.
I hope it's not 5.
gdb {canal,core}-1453060415
...
(gdb) print/x *g
$1 = {stack = {lo = 0xc8202a7800, hi = 0xc8202a7fe0}, stackguard0 = 0xc8202a7ad0, stackguard1 = 0xffffffffffffffff, _panic = 0x0, _defer = 0x0, m = 0x0, stackAlloc = 0x800,
sched = {sp = 0x0, pc = 0x434653, g = 0xc8203dc300, ctxt = 0x0, ret = 0x0, lr = 0x0, bp = 0x0}, syscallsp = 0x0, syscallpc = 0x0, stkbar = {array = 0xc8202a7fe0, len = 0x0,
cap = 0x2}, stkbarPos = 0x0, stktopsp = 0xc8202a7fb8, param = 0x0, atomicstatus = 0x2, stackLock = 0x0, goid = 0x232, waitsince = 0x0, waitreason = {str = 0x9bf1e0,
len = 0x10}, schedlink = 0x0, preempt = 0x0, paniconfault = 0x0, preemptscan = 0x0, gcscandone = 0x1, gcscanvalid = 0x0, throwsplit = 0x0, raceignore = 0x0,
sysblocktraced = 0x0, sysexitticks = 0x0, sysexitseq = 0x0, lockedm = 0x0, sig = 0x0, writebuf = {array = 0x0, len = 0x0, cap = 0x0}, sigcode0 = 0x0, sigcode1 = 0x0,
sigpc = 0x0, gopc = 0x41ef63, startpc = 0x41efd0, racectx = 0x0, waiting = 0x0, gcAssistBytes = 0x0}
Thanks for looking into this, Austin.
The acquirem call in 1452891329 is the one towards the top of gcBgMarkWorker's for loop: https://github.com/golang/go/blob/505fa7b/src/runtime/mgc.go#L1390
I've collected quite a few core dumps from reproducing #13363, which show five different call stacks at the time of the crash. I'm building binaries of the runtime/pprof test for linux/amd64 with go version devel +01b8640 Sat Jan 16 03:35:23 2016 +0000 darwin/amd64
, with a toolchain built with GOEXPERIMENT=framepointer and with the pprof frequency set to 500Hz. I have a gdb script that uses the frame pointer to unwind the stacks.
$ for core in ./core-* ; do gdb --batch --command unwind.gdb ./pprof.test $core 2>/dev/null | grep -A 2 '^\$' | sed -n -e 's/ in section .*//p' | sed -e 's/ //g' | tac | tr '\n' ';' | sed -e 's/;$/\n/' ; done | sort | uniq -c | sort -n -r
38 runtime.goexit+1;runtime.gcBgMarkWorker+1273;runtime.gcMarkDone+62;runtime.semacquire+46
30 runtime.goexit+1;runtime.gcBgMarkWorker+1273;runtime.morestack+13
24 runtime.goexit+1;runtime.gcBgMarkWorker+321;runtime.gopark+46
1 runtime.goexit+1;runtime.gcBgMarkWorker+718;runtime.gcDrain+277;runtime.(*gcWork).balance+148;runtime.handoff+162
1 runtime.goexit+1;runtime.gcBgMarkWorker+321;runtime.gopark+432;runtime.writebarrierptr+129;runtime.writebarrierptr_nostore1+27
stress command:
./stress sh -c "GODEBUG=gcstackbarrierall=1,gcstackbarrierall=0 GOGC=1 ./pprof.test -test.run TestStackBarrierProfiling"
gdb script:
$ cat unwind.gdb
printf "\n"
p/x (('runtime.sigcontext'*)(ctx+0x28)).rip
i sym (('runtime.sigcontext'*)(ctx+0x28)).rip
set $ptr = (long *)(('runtime.sigcontext'*)(ctx+0x28)).rbp
while $ptr != 0
printf "\n"
set $instr = *($ptr+1)
p/x $instr
i sym $instr
set $ptr = *(long **)($ptr)
end
changes to core packages to make reproducing easier, per #13363:
diff --git a/src/runtime/pprof/pprof.go b/src/runtime/pprof/pprof.go
index d32b315..f1d8cbb 100644
--- a/src/runtime/pprof/pprof.go
+++ b/src/runtime/pprof/pprof.go
@@ -577,7 +577,7 @@ func StartCPUProfile(w io.Writer) error {
// system, and a nice round number to make it easy to
// convert sample counts to seconds. Instead of requiring
// each client to specify the frequency, we hard code it.
- const hz = 100
+ const hz = 500
cpu.Lock()
defer cpu.Unlock()
diff --git a/src/runtime/pprof/pprof_test.go b/src/runtime/pprof/pprof_test.go
index c241b54..6d83236 100644
--- a/src/runtime/pprof/pprof_test.go
+++ b/src/runtime/pprof/pprof_test.go
@@ -103,7 +103,7 @@ func parseProfile(t *testing.T, bytes []byte, f func(uintptr, []uintptr)) {
}
hd, val, tl := val[:5], val[5:l-3], val[l-3:]
- if hd[0] != 0 || hd[1] != 3 || hd[2] != 0 || hd[3] != 1e6/100 || hd[4] != 0 {
+ if hd[0] != 0 || hd[1] != 3 || hd[2] != 0 /*|| hd[3] != 1e6/100*/ || hd[4] != 0 {
t.Fatalf("unexpected header %#x", hd)
}
That is, 38 core dumps where the currently running goroutine had a stack of runtime.goexit+1 -> runtime.gcBgMarkWorker+1273 -> runtime.gcMarkDone+62 -> runtime.semacquire+46, which corresponds to a stack of:
runtime.gcBgMarkWorker https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1459 runtime.gcMarkDone https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1042 runtime.semacquire https://github.com/golang/go/blob/01b8640/src/runtime/sema.go#L68
30 core dumps with a stack of runtime.goexit+1 -> runtime.gcBgMarkWorker+1273 -> runtime.morestack+13, meaning:
runtime.gcBgMarkWorker https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1459 runtime.morestack https://github.com/golang/go/blob/01b8640/src/runtime/asm_amd64.s#L324
24 of runtime.goexit+1 -> runtime.gcBgMarkWorker+321 -> runtime.gopark+46:
runtime.gcBgMarkWorker https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1382 runtime.gopark https://github.com/golang/go/blob/01b8640/src/runtime/proc.go#L249
One of runtime.goexit+1 -> runtime.gcBgMarkWorker+718 -> runtime.gcDrain+277 -> runtime.(*gcWork).balance+148 -> runtime.handoff+162:
runtime.gcBgMarkWorker https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1412 runtime.gcDrain https://github.com/golang/go/blob/01b8640/src/runtime/mgcmark.go#L826 runtime.(*gcWork).balance https://github.com/golang/go/blob/01b8640/src/runtime/mgcwork.go#L245 runtime.handoff https://github.com/golang/go/blob/01b8640/src/runtime/mgcwork.go#L461
One of runtime.goexit+1 -> runtime.gcBgMarkWorker+321 -> runtime.gopark+432 -> runtime.writebarrierptr+129 -> runtime.writebarrierptr_nostore1+27:
runtime.gcBgMarkWorker https://github.com/golang/go/blob/01b8640/src/runtime/mgc.go#L1382 runtime.gopark https://github.com/golang/go/blob/01b8640/src/runtime/proc.go#L256 runtime.writebarrierptr https://github.com/golang/go/blob/01b8640/src/runtime/mbarrier.go#L143 runtime.writebarrierptr_nostore1 https://github.com/golang/go/blob/01b8640/src/runtime/mbarrier.go#L110
I think that last crash may be the most interesting, as it happens within runtime.gopark after a successful call to acquirem.
The write to mp.waitunlockf
invokes a write barrier which calls acquirem and hits a SIGSEGV.
func gopark(unlockf func(*g, unsafe.Pointer) bool, lock unsafe.Pointer, reason string, traceEv byte, traceskip int) {
mp := acquirem() // getg().m is non-nil when this line executes
gp := mp.curg
status := readgstatus(gp)
if status != _Grunning && status != _Gscanrunning {
throw("gopark: bad g status")
}
mp.waitlock = lock
mp.waitunlockf = *(*unsafe.Pointer)(unsafe.Pointer(&unlockf)) // SIGSEGV here during write barrier
gp.waitreason = reason
mp.waittraceev = traceEv
mp.waittraceskip = traceskip
releasem(mp)
// can't do anything that might move the G between Ms here.
mcall(park_m)
}
Thanks @tarrant, it looks like that G is very similar to the one from #13363. It's in state _Grunning and the wait reason is set, so the same conclusions apply.
@rhysh, you're right that that last crash is really interesting. What makes it even more interesting is that there's also a write barrier on the previous line that would have crashed if g.m was nil on that line.
That strongly suggests that either the write barrier itself can somehow clear g.m, or something is concurrently clearing it. It's hard to say for sure, but I don't think any of the other stack traces are obviously incompatible with the theory that the write barrier itself is doing it. Though my hunch is that something is concurrently clearing it since there are so many different stacks and the write barrier itself is fairly deterministic.
I may know how this is happening. In the gcMarkDone case of the mark worker, the worker disassociates with the P and makes itself preemptible, and then later attempts to reassociate with the P. If the scheduler moved the worker to a different P while it was preemptible (and a new worker hasn't been started for the original P), then as soon as the worker reassociates with the original P, the scheduler on that P can run the worker. It may do so before the worker has fully parked itself, at which point the worker G is effectively running on two Ms simultaneously.
The sequence of steps is something like:
@rhysh or @tarrant, can you give https://go-review.googlesource.com/18741 a try? It's not a good fix, but it should test if I'm right about the cause.
Hmm. The trybots are indicating that maybe you don't want to try CL 18741. It seems my "XXX This is definitely not ideal." is actually "XXX This causes random deadlocks all over the place." However, this does confirm that the workers really do migrate between Ps sometimes in this code path.
This time for real. @rhysh or @tarrant, can you give https://go-review.googlesource.com/18761 a try? I think that's actually the right fix (and it passes the trybots).
CL https://golang.org/cl/18761 mentions this issue.
Sweet, I've gotten a patched version of go setup and the binary is built and running. I'll let it run for a few days and if nothing happens we can almost safely say this is fixed.
I think there's a bug in my CL that causes it to deadlock occasionally, but any results you can come up with are useful.
Yea looking like a deadlock.
$ sudo perf top -p `pgrep canal`
Events: 2M cycles
29.33% canal [.] runtime.findrunnable
16.41% [kernel] [k] __ticket_spin_lock
7.40% canal [.] runtime.runqgrab
6.34% canal [.] runtime/internal/atomic.Load
3.34% canal [.] runtime/internal/atomic.Xadd64
3.29% canal [.] runtime/internal/atomic.Xadd
2.80% canal [.] runtime.runqsteal
2.73% canal [.] runtime.lock
2.50% canal [.] runtime/internal/atomic.Xchg
2.34% canal [.] runtime.fastrand1
2.26% canal [.] runtime.gcFlushBgCredit
1.54% canal [.] runtime.procyield
1.42% canal [.] runtime/internal/atomic.Cas
0.96% canal [.] runtime.(*gcControllerState).findRunnableGCWorker.func1
0.87% canal [.] runtime.gcmarkwb_m
0.66% canal [.] runtime.schedule
0.65% canal [.] runtime.gcBgMarkWorker
0.59% canal [.] runtime.writebarrierptr_nostore1
0.57% canal [.] runtime.systemstack
0.56% canal [.] runtime.mallocgc
0.55% canal [.] runtime.execute
0.50% [kernel] [k] __wake_up_common
0.49% [vdso] [.] 0x7ffdbd1d870c
0.48% canal [.] runtime.heapBitsForObject
0.39% [kernel] [k] gup_pte_range
Doesn't look like any time is being spent in my code, instead its just spinning trying to find a thread to run.
Actually, it was spinning waiting for a dead thread to enter the "waiting" state, which, needless to say, was never going to happen. :)
But I think I've fixed the problem and it's passing my stress tests. Could you try the new version of CL 18761?
This one seems to run better, but my program is still spending most of its time spinning.
Events: 273K cycles
18.41% [kernel] [k] __ticket_spin_lock
18.12% canal [.] runtime.findrunnable
8.89% canal [.] runtime.runqgrab
7.41% canal [.] runtime/internal/atomic.Load
3.74% canal [.] runtime/internal/atomic.Xadd
3.56% canal [.] runtime/internal/atomic.Xadd64
3.36% canal [.] runtime.runqsteal
3.01% canal [.] runtime.lock
2.85% canal [.] runtime.fastrand1
2.71% canal [.] runtime/internal/atomic.Xchg
2.52% canal [.] runtime.gcFlushBgCredit
1.69% canal [.] runtime.procyield
1.53% canal [.] runtime/internal/atomic.Cas
1.30% canal [.] runtime.gcBgMarkWorker
1.13% canal [.] runtime.(*gcControllerState).findRunnableGCWorker.func1
0.80% canal [.] runtime.gcmarkwb_m
0.70% canal [.] runtime.mallocgc
0.66% canal [.] runtime.writebarrierptr_nostore1
0.63% canal [.] runtime.systemstack
0.63% canal [.] runtime.gcDrain
0.57% [vdso] [.] 0x7ffd809b370c
@tarrant, interesting. What does the profile look like without the CL?
It would look a lot better if I had frame pointers enabled. But:
Events: 295K cycles
73.83% canal (deleted) [.] 0x3e9d4
14.63% [kernel] [k] __ticket_spin_lock
0.74% [kernel] [k] gup_pte_range
0.69% [vdso] [.] 0x7ffccd34f70c
0.49% [kernel] [k] get_futex_key
0.46% [kernel] [k] __wake_up_common
0.41% [kernel] [k] __wait_on_bit_lock
0.40% [kernel] [k] put_page
0.38% [kernel] [k] prepare_to_wait_exclusive
0.30% [kernel] [k] __schedule
0.29% [kernel] [k] get_user_pages_fast
0.28% [kernel] [k] __ticket_spin_unlock
0.25% [kernel] [k] get_futex_key_refs.isra.11
0.25% [kernel] [k] __wake_up_bit
0.25% [kernel] [k] native_write_msr_safe
0.24% [kernel] [k] wake_bit_function
0.24% [kernel] [k] try_to_wake_up
0.21% [kernel] [k] update_curr
0.21% [kernel] [k] copy_user_generic_string
0.20% [kernel] [k] unlock_page
0.18% [kernel] [k] wake_affine
0.18% [kernel] [k] effective_load.isra.57
0.18% [kernel] [k] page_waitqueue
0.16% [kernel] [k] drop_futex_key_refs.isra.14
0.15% [kernel] [k] io_schedule
Thanks @tarrant. Unfortunately, I can't compare these profiles because there aren't any Go symbols in the second one. It looks like it's spending roughly the same amount of time in __ticket_spin_lock, but, for instance, I don't know if runtime.findrunnable has become more expensive with my CL.
Hopefully this will help, this is a perf top
from a binary a from a straight go1.6beta2
. If there is a better way to let you see what is going on let me know and I'll do my best.
Events: 267K cycles
4.18% [kernel] [k] __ticket_spin_lock
3.48% canal [.] runtime.mallocgc
3.16% canal [.] runtime.scanobject
2.03% canal [.] runtime/internal/atomic.Xchg
1.71% canal [.] runtime.pcvalue
1.71% canal [.] runtime.greyobject
1.59% canal [.] runtime.heapBitsForObject
1.54% canal [.] runtime.memmove
1.38% canal [.] runtime/internal/atomic.Cas64
1.38% canal [.] runtime.readvarint
1.32% canal [.] runtime.gcDrain
1.28% canal [.] runtime.findrunnable
1.23% canal [.] runtime.heapBitsSetType
1.20% canal [.] runtime/internal/atomic.Load
1.17% canal [.] runtime.lock
1.15% canal [.] runtime.procyield
1.10% canal [.] runtime.selectgoImpl
1.06% canal [.] sync/atomic.CompareAndSwapUint32
1.05% canal [.] runtime/internal/atomic.Load64
1.05% canal [.] runtime.gentraceback
1.03% canal [.] runtime/internal/atomic.Xadd
1.02% canal [.] runtime.step
0.90% canal [.] sync/atomic.AddUint32
0.88% canal [.] runtime.systemstack
Hi @tarrant. I haven't been able to reproduce any performance problems with this CL with either the go1 benchmarks or the x/benchmarks garbage benchmark. Are you sure those profiles are comparable? Do you have end-to-end benchmark numbers that show a problem?
$ benchstat log.b203f88 log.fa839a1
name old time/op new time/op delta
BinaryTree17-4 2.98s ± 2% 3.00s ± 3% ~ (p=0.690 n=5+5)
Fannkuch11-4 2.81s ± 3% 2.67s ± 3% -5.03% (p=0.016 n=5+5)
FmtFprintfEmpty-4 56.5ns ±14% 53.8ns ± 4% ~ (p=0.310 n=5+5)
FmtFprintfString-4 176ns ±21% 165ns ± 3% ~ (p=0.365 n=5+5)
FmtFprintfInt-4 162ns ± 1% 160ns ± 2% ~ (p=0.556 n=4+5)
FmtFprintfIntInt-4 264ns ± 6% 253ns ± 2% -4.02% (p=0.032 n=5+5)
FmtFprintfPrefixedInt-4 241ns ± 2% 239ns ± 1% ~ (p=0.325 n=5+5)
FmtFprintfFloat-4 339ns ± 5% 336ns ± 1% ~ (p=0.913 n=5+5)
FmtManyArgs-4 1.07µs ± 5% 1.03µs ± 3% ~ (p=0.056 n=5+5)
GobDecode-4 8.23ms ± 5% 8.08ms ± 2% ~ (p=0.222 n=5+5)
GobEncode-4 6.97ms ± 2% 6.90ms ± 1% ~ (p=0.151 n=5+5)
Gzip-4 346ms ± 1% 356ms ±11% ~ (p=0.690 n=5+5)
Gunzip-4 41.3ms ± 3% 40.9ms ± 4% ~ (p=0.421 n=5+5)
HTTPClientServer-4 66.3µs ± 6% 64.2µs ± 0% ~ (p=0.151 n=5+5)
JSONEncode-4 18.0ms ± 2% 17.8ms ± 2% ~ (p=0.095 n=5+5)
JSONDecode-4 52.2ms ± 1% 52.8ms ± 1% ~ (p=0.063 n=5+4)
Mandelbrot200-4 4.19ms ± 0% 4.21ms ± 2% ~ (p=0.286 n=4+5)
GoParse-4 3.65ms ± 1% 3.64ms ± 3% ~ (p=0.310 n=5+5)
RegexpMatchEasy0_32-4 113ns ± 3% 113ns ± 4% ~ (p=0.460 n=5+5)
RegexpMatchEasy0_1K-4 204ns ± 3% 205ns ± 4% ~ (p=1.000 n=5+5)
RegexpMatchEasy1_32-4 91.3ns ± 2% 91.1ns ± 2% ~ (p=0.524 n=5+5)
RegexpMatchEasy1_1K-4 386ns ± 5% 381ns ± 1% ~ (p=0.484 n=5+5)
RegexpMatchMedium_32-4 139ns ± 1% 139ns ± 0% ~ (p=1.000 n=5+5)
RegexpMatchMedium_1K-4 42.5µs ± 3% 41.9µs ± 0% ~ (p=0.111 n=5+4)
RegexpMatchHard_32-4 2.26µs ± 8% 2.19µs ± 2% ~ (p=0.548 n=5+5)
RegexpMatchHard_1K-4 66.3µs ± 1% 66.3µs ± 5% ~ (p=0.365 n=4+5)
Revcomp-4 628ms ±28% 580ms ±20% ~ (p=0.421 n=5+5)
Template-4 59.2ms ± 1% 59.9ms ± 1% ~ (p=0.413 n=4+5)
TimeParse-4 343ns ± 1% 337ns ± 1% ~ (p=0.063 n=4+5)
TimeFormat-4 348ns ± 0% 349ns ± 0% ~ (p=0.333 n=4+5)
I'm going to go ahead and push this CL so we can get it out there, but please open a new issue if it causes performance problems for your workload.
Just tried go1.6rc1 and am having zero issues. I'm going to chalk up my issues to some oddity in how I was patching or compiling.
Thanks a ton for your help in getting this all sorted out and fixed. -T
Thanks to you and @rhysh for all the help debugging!
It appears that my program is segfaulting (receiving a sigsegv) while handling a sigsegv.
I was having some GC issues and a coworker recommended trying go1.6beta1 to see if it would fix the GC issues. It didn't (I've resolved them in other ways) but it did start causing my program to segfault.
Sadly I can't share the actual binary or core dump but I can share other information which should hopefully assist in tracking it down.
The segfault occurs after 12-24 hours of running. The program is network IO intensive. Taking UDP packets on one side and streaming TCP on the other.
CC: @rsc and @ianlancetaylor I was told this would be of interest to you two.
Version
I also had this occur with go1.6beta1, I'm attempting beta2 now, but it won't be till tomorrow at the earliest that I know.
Do you use cgo or unsafe?
I don't believe so and no unsafe.
tarrant@box-name:~$ ldd canal not a dynamic executable
GDB StackTrace
Go stacktrace
Memory info from gdb:
Registers
Object Dump