golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.11k stars 17.68k forks source link

cmd/cgo: crash on arm (unexpected return pc) around tip with large cgo library #7978

Closed gopherbot closed 9 years ago

gopherbot commented 10 years ago

by leterip:

What does 'go version' print?
go version devel +b6591f566db4 Mon May 12 17:10:36 2014 -0400 linux/arm

What steps reproduce the problem?
If possible, include a link to a program on play.golang.org.

I have created a gist at https://gist.github.com/zeebo/364a43f3d2ddfb80f9f8 that has all
the details for reproducing the problem. It should be triggered by:

0. Be on an ARM machine with Go on tip and OpenSSL from debian wheezy (it probably works
with other OpenSSL versions, I just haven't tested) :(
1. go get -d gist.github.com/364a43f3d2ddfb80f9f8.git.git
2. go test gist.github.com/364a43f3d2ddfb80f9f8.git.git

What happened?
runtime: unexpected return pc for runtime.starttheworld called from 0xffffffff

What should have happened instead?
PASS

Please provide any additional information below.
I've checked for any things related to objects passed into CGO without references in Go
and I couldn't find anything. There are additional details in the readme.md in the gist
about conditions I've found for triggering it. It seems like one of those things where a
bunch of non-deterministic things have to all happen.

I have been unable to simplify the code in this report any more than it currently is and
I understand it is quite a bit of code with a large CGO dependency around wrapping
OpenSSL of all things and I'm very willing to help out with any information I can in
getting this resolved.
gopherbot commented 10 years ago

Comment 1 by leterip:

I forgot to mention clearly that this passes on linux amd64, linux 386, and darwin amd64
on ~tip and it passes on everything on go 1.2.1. The only failure I have found is linux
arm on ~tip.
~tip means go1.3 beta (f8b50ad4cac4) and b6591f566db4.
ianlancetaylor commented 10 years ago

Comment 2:

Labels changed: added repo-main, release-go1.3maybe, arch-arm.

bradfitz commented 10 years ago

Comment 3:

Sure this is only ARM?
On linux-386-387 we just got:
http://build.golang.org/log/8bc3fb574de73d621f655b4a6ad27f920bfe67d1
runtime: unexpected return pc for runtime.cgocallbackg called from 0x0
fatal error: unknown caller pc
runtime stack:
runtime.throw(0x8275671)
    /usr/local/go/src/pkg/runtime/panic.c:520 +0x71
runtime.gentraceback(0x80503ad, 0xb74250d0, 0x0, 0x18402e60, 0x0, 0x0, 0x7fffffff,
0x8064220, 0xbfc55498, 0x8063e00)
    /usr/local/go/src/pkg/runtime/traceback_x86.c:167 +0x667
copyabletopsegment(0x18402e60)
    /usr/local/go/src/pkg/runtime/stack.c:332 +0x95
runtime.newstack()
    /usr/local/go/src/pkg/runtime/stack.c:746 +0x3a4
runtime.morestack()
    /usr/local/go/src/pkg/runtime/asm_386.s:246 +0x68
goroutine 67 [stack growth]:
runtime: unexpected return pc for runtime.cgocallbackg called from 0x0
runtime.cgocallbackg()
    /usr/local/go/src/pkg/runtime/cgocall.c:244 +0x4d fp=0xb74250d8
created by testing.RunTests
    /usr/local/go/src/pkg/testing/testing.go:504 +0x6d0
goroutine 16 [chan receive]:
testing.RunTests(0x81d2dc8, 0x82762a0, 0x2f, 0x2f, 0x806f001)
    /usr/local/go/src/pkg/testing/testing.go:505 +0x70c
testing.Main(0x81d2dc8, 0x82762a0, 0x2f, 0x2f, 0x8272710, 0x1, 0x1, 0x828a060, 0x0, 0x0)
    /usr/local/go/src/pkg/testing/testing.go:435 +0x6e
main.main()
    _/tmp/gobuilder/linux-386-387-238ff8c01b5b/go/misc/cgo/test/_test/_testmain.go:141 +0x86
goroutine 19 [finalizer wait]:
runtime.park(0x805ec30, 0x82843e4, 0x82785e9)
    /usr/local/go/src/pkg/runtime/proc.c:1369 +0x94
runtime.parkunlock(0x82843e4, 0x82785e9)
    /usr/local/go/src/pkg/runtime/proc.c:1385 +0x3f
runfinq()
    /usr/local/go/src/pkg/runtime/mgc0.c:2636 +0xc5
runtime.goexit()
    /usr/local/go/src/pkg/runtime/proc.c:1445
goroutine 20 [syscall]:
os/signal.loop()
    /usr/local/go/src/pkg/os/signal/signal_unix.go:21 +0x24
created by os/signal.init·1
    /usr/local/go/src/pkg/os/signal/signal_unix.go:27 +0x37
goroutine 17 [syscall]:
runtime.cgocallbackg()
    /usr/local/go/src/pkg/runtime/cgocall.c:244 +0x4d
runtime.cgocallback_gofunc(0x0, 0x0, 0x0)
    /usr/local/go/src/pkg/runtime/asm_386.s:690 +0x5a
runtime.goexit()
    /usr/local/go/src/pkg/runtime/proc.c:1445
goroutine 50 [syscall]:
runtime.goexit()
    /usr/local/go/src/pkg/runtime/proc.c:1445
exit status 2
FAIL    _/tmp/gobuilder/linux-386-387-238ff8c01b5b/go/misc/cgo/test 2.167s
Build complete, duration 13m16.183272956s. Result: error: exit status 1
gopherbot commented 10 years ago

Comment 4 by leterip:

Awesome. I was (and still am) unable to reproduce on my 32 bit linux VM.
gopherbot commented 10 years ago

Comment 5 by leterip:

If I set up an environment with the repro ready to go for you guys to ssh to, would that
help? I'm willing to help in any way I can.
rsc commented 10 years ago

Comment 6:

Too late.

Labels changed: added release-go1.4, removed release-go1.3maybe.

gopherbot commented 10 years ago

Comment 7 by leterip:

So the crash is happening for me in gentraceback. I added some prints and stuff to the
code to try to debug further. Here is one such output:
starting gentraceback(0xffffffff, 0xffffffff, 0x0, 0x10602640, 0, 0x0, 2147483647,
0x19324, 0x41db8bd0, 0) used from sycallstack=0x404b3fd0 pc0=0x14d24 sp0=0x404b3ee8
pc=0x14d24 lr=0x41f9c fp=0x404b3ef0 sp=0x404b3ee8 argp=0x404b3ef4 arglen=0
runtime.cgocallbackg
pc=0x41f9c lr=0x41ee4 fp=0x404b3efc sp=0x404b3ef0 argp=0x404b3f00 arglen=12
runtime.cgocallback_gofunc
pc=0x41ee4 lr=0x26954 fp=0x404b3f00 sp=0x404b3efc argp=0x404b3f04 arglen=8
runtime.asmcgocall
pc=0x26954 lr=0x11629c fp=0x404b3f00 sp=0x404b3f00 argp=0x404b3f04 arglen=4 exitsyscall0
pc=0x11629c lr=0x11629c fp=0x404b3f14 sp=0x404b3f00 argp=0x404b3f18 arglen=16
github.com/spacemonkeygo/openssl._C2func_SSL_do_handshake
pc=0x11629c lr=0x404b3f01 fp=0x404b3f28 sp=0x404b3f14 
I'm unable to find a shorter reproduction still, but I think this happens when a
gentraceback is called on a goroutine that is returning from a C->Go callback in the
cgocallbackg function right before/after it hits entersyscall. I have no idea if this
theory is possible or makes sense.
I have also discovered that if the bio free does not call back into Go, the repro fails
to fail. Since this clearly doesn't fix the underlying cause, I still think there's a
bug on the Go side of things here (it doesn't look like there's stack corruption from
the C code, but I could be way wrong there too). This commit shows the change:
https://github.com/spacemonkeygo/openssl/commit/05477109fd870990155ef0806ab04e91312412f
Hopefully this information helps, and the offer to set up a debug environment or to
provide any information I can still stands. Any suggestions on where to look or
strategies used to debug this stuff I'd also gladly accept :)
mirtchovski commented 10 years ago

Comment 8:

I'm having the same issues. Program completes successfully on 1.2.2 but crashes with
above message on tip. This is OSX, 10.8.5...
I'm working towards a shorter reproducible case but can't promise much. if you have any
ideas how to help debug this please share.
runtime: unexpected return pc for runtime.cgocallbackg called from 0x0
fatal error: unknown caller pc
runtime stack:
runtime.throw(0x4aab451)
    /Users/aam/go/src/pkg/runtime/panic.c:523 +0x77 fp=0xb0621ba0 sp=0xb0621b88
runtime.gentraceback(0x473a8a3, 0x4ebcd40, 0x0, 0xc208064b40, 0x0, 0x0,
0xaa8280007fffffff, 0x473edc0, 0xb0621d78, 0xaa8280000596c000)
    /Users/aam/go/src/pkg/runtime/traceback_x86.c:180 +0xbb7 fp=0xb0621c88 sp=0xb0621ba0
addstackroots(0xc208064b40, 0xb0621d78)
    /Users/aam/go/src/pkg/runtime/mgc0.c:1658 +0x14a fp=0xb0621d08 sp=0xb0621c88
markroot(0xc208018480, 0xf)
    /Users/aam/go/src/pkg/runtime/mgc0.c:1296 +0xbe fp=0xb0621d88 sp=0xb0621d08
runtime.parfordo(0xc208018480)
    /Users/aam/go/src/pkg/runtime/parfor.c:103 +0x128 fp=0xb0621e00 sp=0xb0621d88
runtime.gchelper()
    /Users/aam/go/src/pkg/runtime/mgc0.c:2084 +0x46 fp=0xb0621e20 sp=0xb0621e00
stopm()
    /Users/aam/go/src/pkg/runtime/proc.c:961 +0x149 fp=0xb0621e38 sp=0xb0621e20
startlockedm(0xc208064b40)
    /Users/aam/go/src/pkg/runtime/proc.c:1104 +0xa8 fp=0xb0621e58 sp=0xb0621e38
schedule()
    /Users/aam/go/src/pkg/runtime/proc.c:1356 +0x127 fp=0xb0621e80 sp=0xb0621e58
park0(0xc208002120)
    /Users/aam/go/src/pkg/runtime/proc.c:1414 +0x133 fp=0xb0621ea0 sp=0xb0621e80
runtime.mcall(0x47333cd)
    /Users/aam/go/src/pkg/runtime/asm_amd64.s:183 +0x52 fp=0xb0621eb0 sp=0xb0621ea0
goroutine 16 [runnable]:
syscall.Syscall6(0x158, 0xa, 0xc20817e000, 0x1000, 0xc208176b68, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/aam/go/src/pkg/syscall/asm_darwin_amd64.s:44 +0x5 fp=0xc208176ac0 sp=0xc208176ab8
syscall.Getdirentries(0xa, 0xc20817e000, 0x1000, 0x1000, 0xc208176b68, 0x10, 0x0, 0x0)
    /Users/aam/go/src/pkg/syscall/zsyscall_darwin_amd64.go:560 +0x97 fp=0xc208176b28 sp=0xc208176ac0
syscall.ReadDirent(0xa, 0xc20817e000, 0x1000, 0x1000, 0xc208034680, 0x0, 0x0)
    /Users/aam/go/src/pkg/syscall/syscall_bsd.go:71 +0x78 fp=0xc208176b80 sp=0xc208176b28
os.(*File).readdirnames(0xc2081ea018, 0xfffffffffffffffc, 0xc208034680, 0x3, 0x64, 0x0,
0x0)
    /Users/aam/go/src/pkg/os/dir_unix.go:39 +0x1c9 fp=0xc208176c60 sp=0xc208176b80
os.(*File).Readdirnames(0xc2081ea018, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/aam/go/src/pkg/os/doc.go:134 +0xa5 fp=0xc208176ca0 sp=0xc208176c60
os.(*File).readdir(0xc2081ea018, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/aam/go/src/pkg/os/file_unix.go:158 +0xca fp=0xc208176da8 sp=0xc208176ca0
os.(*File).Readdir(0xc2081ea018, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/aam/go/src/pkg/os/doc.go:115 +0xa5 fp=0xc208176de8 sp=0xc208176da8
io/ioutil.ReadDir(0xc20818a200, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/aam/go/src/pkg/io/ioutil/ioutil.go:105 +0xed fp=0xc208176e70 sp=0xc208176de8
gopherbot commented 10 years ago

Comment 9 by hector@marcansoft.com:

We're seeing this too, on linux-amd64, again on cgo code with callbacks back into go (in
this case, a libjpeg wrapper). I don't have a good test case yet though. I've
abbreviated the dump; this happened during load testing of a server so there are lots of
goroutines alive.
runtime: unexpected return pc for runtime.cgocallbackg called from 0x0
fatal error: unknown caller pc
runtime stack:
runtime.throw(0xa695d1)
    /usr/local/go/src/pkg/runtime/panic.c:520 +0x69 fp=0x7fe4d5642b60 sp=0x7fe4d5642b48
runtime.gentraceback(0x406f0c, 0xc20821d7a8, 0x0, 0xc20833e6c0, 0x7fe400000000, 0x0,
0x7fffffff, 0x40b4c0, 0x7fe4d5642d48, 0x2ac0ffe4da1ce000)
    /usr/local/go/src/pkg/runtime/traceback_x86.c:180 +0xba9 fp=0x7fe4d5642c48 sp=0x7fe4d5642b60
addstackroots(0xc20833e6c0, 0x7fe4d5642d48)
    /usr/local/go/src/pkg/runtime/mgc0.c:1692 +0x168 fp=0x7fe4d5642cd8 sp=0x7fe4d5642c48
markroot(0xc208018480, 0x89)
    /usr/local/go/src/pkg/runtime/mgc0.c:1321 +0xbe fp=0x7fe4d5642d58 sp=0x7fe4d5642cd8
runtime.parfordo(0xc208018480)
    /usr/local/go/src/pkg/runtime/parfor.c:103 +0x128 fp=0x7fe4d5642dd0 sp=0x7fe4d5642d58
runtime.gchelper()
    /usr/local/go/src/pkg/runtime/mgc0.c:2118 +0x3f fp=0x7fe4d5642df0 sp=0x7fe4d5642dd0
stopm()
    /usr/local/go/src/pkg/runtime/proc.c:957 +0x111 fp=0x7fe4d5642e08 sp=0x7fe4d5642df0
gcstopm()
    /usr/local/go/src/pkg/runtime/proc.c:1122 +0xc4 fp=0x7fe4d5642e28 sp=0x7fe4d5642e08
schedule()
    /usr/local/go/src/pkg/runtime/proc.c:1321 +0x4a fp=0x7fe4d5642e50 sp=0x7fe4d5642e28
park0(0xc20833e6c0)
    /usr/local/go/src/pkg/runtime/proc.c:1410 +0xfe fp=0x7fe4d5642e70 sp=0x7fe4d5642e50
runtime.mcall(0x402a94)
    /usr/local/go/src/pkg/runtime/asm_amd64.s:181 +0x4b fp=0x7fe4d5642e80 sp=0x7fe4d5642e70
[...]
goroutine 9629 [runnable]:
runtime.cgocallbackg()
    /usr/local/go/src/pkg/runtime/cgocall.c:244 +0x4c fp=0xc2087117b8 sp=0xc2087117a8
runtime.cgocallback_gofunc(0x406d92, 0x402660, 0xc208711830)
    /usr/local/go/src/pkg/runtime/asm_amd64.s:785 +0x67 fp=0xc2087117c8 sp=0xc2087117b8
exitsyscall0(0x402660)
    /usr/local/go/src/pkg/runtime/proc.c:1674 fp=0xc2087117d0 sp=0xc2087117c8
runtime.cgocall(0x402660, 0xc208711830)
    /usr/local/go/src/pkg/runtime/cgocall.c:144 +0x102 fp=0xc208711818 sp=0xc2087117d0
github.com/pixiv/go-thumber/jpeg._Cfunc_jpeg_read_raw_data(0xc2080bf8c0, 0xc20800e720,
0x4, 0xc200000002)
    github.com/pixiv/go-thumber/jpeg/_obj/_cgo_defun.c:172 +0x31 fp=0xc208711830 sp=0xc208711818
[...]
goroutine 9665 [runnable]:
runtime: unexpected return pc for runtime.cgocallbackg called from 0x0
runtime.cgocallbackg()
    /usr/local/go/src/pkg/runtime/cgocall.c:244 +0x4c fp=0xc20821d7b8 sp=0xc20821d7a8
created by net/http.(*Server).Serve
    /usr/local/go/src/pkg/net/http/server.go:1721 +0x313
[...]
Note how a stacktrace of goroutine 9629 seems to work, while 9665 fails even though they
both seem to be in the same or a very similar code path (the low bits of SP match...).
gopherbot commented 10 years ago

Comment 10 by marcan@pixiv.co.jp:

I'm pretty sure I figured out what's going on. The tl;dr version is that
runtime.cgocallbackg doesn't play nice with runtime.entersyscall and
runtime.exitsyscall. They assume that the stack frame that called entersyscall is valid
until exitsyscall is called, which is not the case when returning from a C->Go
callback. entersyscall saves the caller's SP for use by the GC/traceback functions, but
by the time exitsyscall is called that frame is no longer valid.
This is a diagram of what I think happens:
* = g->syscallsp
g stack ========>                 g0 stack =======>
 Go [ReadJPEG()]
   runtime.cgocall(...) [C.jpeg_read_raw_data]
    *runtime.entersyscall()
    *runtime.asmcgocall(...)
    *  === STACK SWITCH ===
    *                             _cgo_C...()
    *                               c_func() [jpeg_read_raw_data()]
    *                                 GoF() [mgr->fill_input_buffer() --> sourceFill()]
    *                                   crosscall2(...)
    *                                     _cgoexp_GoF()
    *                                       runtime.cgocallback(...)
    *                                         runtime.cgocallback_gofunc(...)
    *                                           === STACK SWITCH ===
    *  runtime.cgocallbackg()
    *    runtime.exitsyscall()
         runtime.cgocallbackg1()
           runtime.newstackcall(...)
             === STACK SWITCH ===
                                                runtime.newstack(...)
                                                  runtime.gogo(...)
                                                    === STACK SWITCH ===
             +> stack segment boundary
             | pkg.GoF() [jpeg.sourceFill()]
             +<
           <-runtime.newstackcall
         <-runtime.cgocallbackg1
  /     *runtime.entersyscall()
 |     <*runtime.cgocallbackg
 |     =*= STACK SWITCH ===
 |      *                                     <-runtime.cgocallback_gofunc
 |      *                                   <-runtime.cgocallback
 |      *                                 <-_cgoexp_GoF
 |      *                               <-crosscall2
 |      *                             <-GoF
 |      *                           <-c_func [jpeg_read_raw_data]
 |      *                         <-_cgo_C...
 |      *                         === STACK SWITCH ===
 |   <-r*ntime.asmcgocall
 |\  run*ime.exitsyscall()  <- actual crash happens here (1)
 | <-runtime.cgocall()
 |
 +- danger area, g->syscallsp < SP
(1) exitsyscall calls runtime.mcall before setting syscallsp to nil.
    runtime.mcall yields. Eventually the GC gets called, tries to trace the
    stack starting at g->syscallsp, which is now garbage.
This is a raw stack dump that I got while reproducing the original issue with a patched
traceback function:
goroutine 55430 [runnable]:
runtime: unexpected return pc for runtime.cgocallbackg called from 0x0
pc0=0x4081fc sp0=0xc208855d70 lr0=0x0
gp->status=1
gp->syscallstack=0xc208859fa8
gp->syscallsp=0xc208855d70
gp->syscallpc=0x4081fc
gp->syscallguard=0xc208852100
gp->sched.sp=0xc208855d88
gp->sched.pc=0x417eec
gp->stackbase=0xc208859fa8
gp->stackguard=0xc208852100
gp->sched.ctxt=0x0
stack dump:
0xc208855cf0: 0x411f37 (runtime.futexsleep+0x0x47)
0xc208855cf8: 0xa84400
0xc208855d00: 0x200000000
0xc208855d08: 0x412002 (runtime.futexwakeup+0x0x52)
0xc208855d10: 0xa84400
0xc208855d18: 0x100000001
0xc208855d20: 0x0
0xc208855d28: 0x0
0xc208855d30: 0x0
0xc208855d38: 0xa84400
0xc208855d40: 0x408668 (runtime.unlock+0x0x68)
0xc208855d48: 0xa84400
0xc208855d50: 0x200000001
0xc208855d58: 0x200408483
0xc208855d60: 0x417fec (exitsyscallfast+0x0xbc)
0xc208855d68: 0xa84400
0xc208855d70: 0xc20806c00c <=== SP
0xc208855d78: 0x0
0xc208855d80: 0x417eec (runtime.exitsyscall+0x0x10c) <=== FP
0xc208855d88: 0x4180c0 (exitsyscall0+0x0x0)
0xc208855d90: 0x408082 (runtime.cgocall+0x0x102)
0xc208855d98: 0x402720
0xc208855da0: 0xc208855df8
0xc208855da8: 0x7f0100000000
0xc208855db0: 0xffffffffffffffff
0xc208855db8: 0x19000
0xc208855dc0: 0xa660d8
0xc208855dc8: 0xc2080bac30
0xc208855dd0: 0x42836b (runtime.new+0x0x3b)
0xc208855dd8: 0x566311
(github.com/pixiv/go-thumber/jpeg._Cfunc_jpeg_read_raw_data+0x0x31)
0xc208855de0: 0x402720
0xc208855de8: 0xc208855df8
0xc208855df0: 0x5635c0 (github.com/pixiv/go-thumber/jpeg.ReadJPEG+0x0x770)
Hence we see a traceback where the tracer thinks it's in cgocallbackg (which called
entersyscall), but the actual stack shows runtime.cgocall -> runtime.exitsyscall
-> runtime.mcall, and the traceback routine is very confused since it can't find the
LR in the right place. Note how gp->sched.sp > gp->syscallsp. I'm dumping some
stack before the SP too, but those are invalid/leftover stack frames, so take them with
a grain of salt.
I've attached a patch that adds a sanity check to exitsyscall that panics if this ever
happens, and a tentative fix to runtime.cgocallback that saves the syscallsp and
restores it after entersyscall. I'm not sure if this is sufficient, though; I suspect it
also needs to be restored in a defer in case of a panic. I didn't see any panic handling
code in entersyscall/exitsyscall themselves so I'm not sure (haven't checked the normal
syscall path yet). The sanity check will go off on every c->go callback if patched in
alone, so that's an easy way to catch the issue without having to wait for the stars to
align and the GC to get called in the right place and cause a crash.
(Note: I've just signed the Individual CLA in case you want to use the patch; I assume
it might take a while to get processed though.)

Attachments:

  1. cgocallback_syscall_fix.patch (1699 bytes)
gopherbot commented 10 years ago

Comment 11 by hector@marcansoft.com:

I'm pretty sure I figured out what's going on. The tl;dr version is that
runtime.cgocallbackg doesn't play nice with runtime.entersyscall and
runtime.exitsyscall. They assume that the stack frame that called entersyscall is valid
until exitsyscall is called, which is not the case when returning from a C->Go
callback. entersyscall saves the caller's SP for use by the GC/traceback functions, but
by the time exitsyscall is called that frame is no longer valid.
This is a diagram of what I think happens:
* = g->syscallsp
g stack ========>                 g0 stack =======>
 Go [ReadJPEG()]
   runtime.cgocall(...) [C.jpeg_read_raw_data]
    *runtime.entersyscall()
    *runtime.asmcgocall(...)
    *  === STACK SWITCH ===
    *                             _cgo_C...()
    *                               c_func() [jpeg_read_raw_data()]
    *                                 GoF() [mgr->fill_input_buffer() --> sourceFill()]
    *                                   crosscall2(...)
    *                                     _cgoexp_GoF()
    *                                       runtime.cgocallback(...)
    *                                         runtime.cgocallback_gofunc(...)
    *                                           === STACK SWITCH ===
    *  runtime.cgocallbackg()
    *    runtime.exitsyscall()
         runtime.cgocallbackg1()
           runtime.newstackcall(...)
             === STACK SWITCH ===
                                                runtime.newstack(...)
                                                  runtime.gogo(...)
                                                    === STACK SWITCH ===
             +> stack segment boundary
             | pkg.GoF() [jpeg.sourceFill()]
             +<
           <-runtime.newstackcall
         <-runtime.cgocallbackg1
  /     *runtime.entersyscall()
 |     <*runtime.cgocallbackg
 |     =*= STACK SWITCH ===
 |      *                                     <-runtime.cgocallback_gofunc
 |      *                                   <-runtime.cgocallback
 |      *                                 <-_cgoexp_GoF
 |      *                               <-crosscall2
 |      *                             <-GoF
 |      *                           <-c_func [jpeg_read_raw_data]
 |      *                         <-_cgo_C...
 |      *                         === STACK SWITCH ===
 |   <-r*ntime.asmcgocall
 |\  run*ime.exitsyscall()  <- actual crash happens here (1)
 | <-runtime.cgocall()
 |
 +- danger area, g->syscallsp < SP
(1) exitsyscall calls runtime.mcall before setting syscallsp to nil.
    runtime.mcall yields. Eventually the GC gets called, tries to trace the
    stack starting at g->syscallsp, which is now garbage.
This is a raw stack dump that I got while reproducing the original issue with a patched
traceback function:
goroutine 55430 [runnable]:
runtime: unexpected return pc for runtime.cgocallbackg called from 0x0
pc0=0x4081fc sp0=0xc208855d70 lr0=0x0
gp->status=1
gp->syscallstack=0xc208859fa8
gp->syscallsp=0xc208855d70
gp->syscallpc=0x4081fc
gp->syscallguard=0xc208852100
gp->sched.sp=0xc208855d88
gp->sched.pc=0x417eec
gp->stackbase=0xc208859fa8
gp->stackguard=0xc208852100
gp->sched.ctxt=0x0
stack dump:
0xc208855cf0: 0x411f37 (runtime.futexsleep+0x0x47)
0xc208855cf8: 0xa84400
0xc208855d00: 0x200000000
0xc208855d08: 0x412002 (runtime.futexwakeup+0x0x52)
0xc208855d10: 0xa84400
0xc208855d18: 0x100000001
0xc208855d20: 0x0
0xc208855d28: 0x0
0xc208855d30: 0x0
0xc208855d38: 0xa84400
0xc208855d40: 0x408668 (runtime.unlock+0x0x68)
0xc208855d48: 0xa84400
0xc208855d50: 0x200000001
0xc208855d58: 0x200408483
0xc208855d60: 0x417fec (exitsyscallfast+0x0xbc)
0xc208855d68: 0xa84400
0xc208855d70: 0xc20806c00c <=== SP
0xc208855d78: 0x0
0xc208855d80: 0x417eec (runtime.exitsyscall+0x0x10c) <=== FP
0xc208855d88: 0x4180c0 (exitsyscall0+0x0x0)
0xc208855d90: 0x408082 (runtime.cgocall+0x0x102)
0xc208855d98: 0x402720
0xc208855da0: 0xc208855df8
0xc208855da8: 0x7f0100000000
0xc208855db0: 0xffffffffffffffff
0xc208855db8: 0x19000
0xc208855dc0: 0xa660d8
0xc208855dc8: 0xc2080bac30
0xc208855dd0: 0x42836b (runtime.new+0x0x3b)
0xc208855dd8: 0x566311
(github.com/pixiv/go-thumber/jpeg._Cfunc_jpeg_read_raw_data+0x0x31)
0xc208855de0: 0x402720
0xc208855de8: 0xc208855df8
0xc208855df0: 0x5635c0 (github.com/pixiv/go-thumber/jpeg.ReadJPEG+0x0x770)
Hence we see a traceback where the tracer thinks it's in cgocallbackg (which called
entersyscall), but the actual stack shows runtime.cgocall -> runtime.exitsyscall
-> runtime.mcall, and the traceback routine is very confused since it can't find the
LR in the right place. Note how gp->sched.sp > gp->syscallsp. I'm dumping some
stack before the SP too, but those are invalid/leftover stack frames, so take them with
a grain of salt.
I've attached a patch that adds a sanity check to exitsyscall that panics if this ever
happens, and a tentative fix to runtime.cgocallback that saves the syscallsp and
restores it after entersyscall. I'm not sure if this is sufficient, though; I suspect it
also needs to be restored in a defer in case of a panic. I didn't see any panic handling
code in entersyscall/exitsyscall themselves so I'm not sure (haven't checked the normal
syscall path yet). The sanity check will go off on every c->go callback if patched in
alone, so that's an easy way to catch the issue without having to wait for the stars to
align and the GC to get called in the right place and cause a crash.
(Note: I've just signed the Individual CLA in case you want to use the patch; I assume
it might take a while to get processed though.)

Attachments:

  1. cgocallback_syscall_fix.patch (1699 bytes)
davecheney commented 10 years ago

Comment 12:

Thanks for the efforts to debug this issue. We cannot accept patches via the issue
tracker, please use the code review tool described in the
http://golang.org/doc/contribute.html

Status changed to Started.

gopherbot commented 10 years ago

Comment 13 by hector@marcansoft.com:

CL sent for review: https://golang.org/cl/131910043/
Upon thinking about it for a bit, I don't think any special handling for panics is
needed, since AFAICT those should only happen in go code that isn't "in a syscall" (and
should unwind skipping any C code/stack paths).
gopherbot commented 10 years ago

Comment 14:

CL https://golang.org/cl/131910043 mentions this issue.
dvyukov commented 10 years ago

Comment 15:

Russ, we need to fix it for 1.4. It renders cgo callbacks broken.
rsc commented 10 years ago

Comment 16:

This issue was closed by revision 7283e08cbf06bcd32a391183e26080cff301e7f.

Status changed to Fixed.

gopherbot commented 10 years ago

Comment 17:

CL https://golang.org/cl/152970043 mentions this issue.