golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.14k stars 17.56k forks source link

runtime: unlikely looking panic in append #5091

Closed rogpeppe closed 9 years ago

rogpeppe commented 11 years ago
I have seen this panic just once, after running
the same tests several times previously.

It looks like an indication of a deeper problem.
Unfortunately the program in question *does* use
cgo (although the code in question has been stable
for a long time) so it could be a problem related
to how the new GC is interacting with the cgo code.

The line of code that invoked the panicing append
is this:

    w.syncEvents = append(w.syncEvents, event{info.ch, key, revno})

GOMAXPROCS=5
unexpected fault address 0x0
fatal error: fault
[signal 0xb code=0x80 addr=0x0 pc=0x40bb67]

goroutine 1098 [running]:
[fp=0x7f6971ef28a8] runtime.throw(0xd96617)
    /home/rog/go/src/pkg/runtime/panic.c:473 +0x67
[fp=0x7f6971ef28c0] runtime.sigpanic()
    /home/rog/go/src/pkg/runtime/os_linux.c:239 +0xe7
[fp=0x7f6971ef2900] runtime.MCentral_AllocList(0x7f6971880470, 0x20, 0x7f6971ef2928)
    /home/rog/go/src/pkg/runtime/mcentral.c:64 +0xb7
[fp=0x7f6971ef2940] runtime.MCache_Alloc(0x7f697203afd0, 0xd, 0xc0, 0x1)
    /home/rog/go/src/pkg/runtime/mcache.c:24 +0x83
[fp=0x7f6971ef2998] runtime.mallocgc(0xc0, 0x100000000, 0xc200000001)
    /home/rog/go/src/pkg/runtime/zmalloc_linux_amd64.c:47 +0xf5
[fp=0x7f6971ef29d0] makeslice1(0x8189e0, 0x2, 0x4, 0x7f6971ef2a70)
    /home/rog/go/src/pkg/runtime/slice.c:63 +0xb6
[fp=0x7f6971ef2a00] growslice1(0x8189e0, 0xc2002171e0, 0x2, 0x2, 0x3, ...)
    /home/rog/go/src/pkg/runtime/slice.c:230 +0x56
[fp=0x7f6971ef2a48] runtime.growslice(0x8189e0, 0xc2002171e0, 0x2, 0x2, 0x1, ...)
    /home/rog/go/src/pkg/runtime/slice.c:197 +0x9f
[fp=0x7f6971ef2f18] launchpad.net/juju-core/state/watcher.(*Watcher).sync(0xc20032c160,
0xc200869d20, 0x7f6971ef2f70)
    /home/rog/src/go/src/launchpad.net/juju-core/state/watcher/watcher.go:410 +0xcdb
[fp=0x7f6971ef2f90] launchpad.net/juju-core/state/watcher.(*Watcher).loop(0xc20032c160,
0x0, 0xc20031a028)
    /home/rog/src/go/src/launchpad.net/juju-core/state/watcher/watcher.go:222 +0x16b
[fp=0x7f6971ef2fb8] launchpad.net/juju-core/state/watcher.func·001()
    /home/rog/src/go/src/launchpad.net/juju-core/state/watcher/watcher.go:111 +0x2c
[fp=0x7f6971ef2fc0] runtime.goexit()
    /home/rog/go/src/pkg/runtime/proc.c:1214
created by launchpad.net/juju-core/state/watcher.New
    /home/rog/src/go/src/launchpad.net/juju-core/state/watcher/watcher.go:113 +0x100

Go version:

ff779e477085 tip
rogpeppe commented 11 years ago

Comment 1:

I'm also seeing very occasional SIGSEGVs with no obvious currently running process.
Same revision of Go.
For example:
GOMAXPROCS=83
SIGSEGV: segmentation violation
PC=0x4090e2
goroutine 1 [chan receive]:
testing.RunTests(0xaae7a0, 0xd90710, 0x1, 0x1, 0xc2ffffff01, ...)
    /home/rog/go/src/pkg/testing/testing.go:427 +0x88e
testing.Main(0xaae7a0, 0xd90710, 0x1, 0x1, 0xd9ae80, ...)
    /home/rog/go/src/pkg/testing/testing.go:358 +0x8a
main.main()
    launchpad.net/juju-core/worker/uniter/_test/_testmain.go:43 +0x9a
goroutine 2 [syscall]:
goroutine 5 [chan receive]:
launchpad.net/juju-core/environs/dummy.func·001()
    /home/rog/src/go/src/launchpad.net/juju-core/environs/dummy/environs.go:160 +0x3a
created by launchpad.net/juju-core/environs/dummy.init·1
    /home/rog/src/go/src/launchpad.net/juju-core/environs/dummy/environs.go:162 +0xc1
goroutine 6 [chan receive]:
launchpad.net/gocheck.(*suiteRunner).runTest(0xc200249a10, 0xc200249540, 0xc2005b8c80)
    /home/rog/src/go/src/launchpad.net/gocheck/gocheck.go:723 +0x4f
launchpad.net/gocheck.(*suiteRunner).run(0xc200249a10, 0xc200175240)
    /home/rog/src/go/src/launchpad.net/gocheck/gocheck.go:564 +0x1c5
launchpad.net/gocheck.Run(0x9a48c0, 0xc200175240, 0xc20026c280, 0xc200805f80)
    /home/rog/src/go/src/launchpad.net/gocheck/run.go:65 +0x47
launchpad.net/gocheck.RunAll(0xc20026c280, 0xc20026c280)
    /home/rog/src/go/src/launchpad.net/gocheck/run.go:57 +0xc2
launchpad.net/gocheck.TestingT(0xc2001753f0)
    /home/rog/src/go/src/launchpad.net/gocheck/run.go:45 +0x98
launchpad.net/juju-core/testing.MgoTestPackage(0xc2001753f0)
    /home/rog/src/go/src/launchpad.net/juju-core/testing/mgo.go:119 +0xcf
launchpad.net/juju-core/worker/uniter_test.TestPackage(0xc2001753f0)
    /home/rog/src/go/src/launchpad.net/juju-core/worker/uniter/uniter_test.go:32 +0x25
testing.tRunner(0xc2001753f0, 0xd90710)
    /home/rog/go/src/pkg/testing/testing.go:346 +0x8a
created by testing.RunTests
    /home/rog/go/src/pkg/testing/testing.go:426 +0x86b
goroutine 7 [syscall]:
syscall.Syscall()
    /home/rog/go/src/pkg/syscall/asm_linux_amd64.s:16 +0x5
syscall.read(0x4, 0xc20028b000, 0x1000, 0x1000, 0x55, ...)
    /home/rog/go/src/pkg/syscall/zerrors_linux_amd64.go:2337 +0x70
syscall.Read(0x4, 0xc20028b000, 0x1000, 0x1000, 0x55, ...)
    /home/rog/go/src/pkg/syscall/syscall_unix.go:132 +0x5a
os.(*File).read(0xc200000fc0, 0xc20028b000, 0x1000, 0x1000, 0x55, ...)
    /home/rog/go/src/pkg/os/file_unix.go:174 +0x60
os.(*File).Read(0xc200000fc0, 0xc20028b000, 0x1000, 0x1000, 0x0, ...)
    /home/rog/go/src/pkg/os/file.go:95 +0x96
bufio.(*Reader).fill(0xc2002e4000)
    /home/rog/go/src/pkg/bufio/bufio.go:79 +0x10c
bufio.(*Reader).ReadSlice(0xc2002e4000, 0xc1fff9630a, 0x0, 0x0, 0x0, ...)
    /home/rog/go/src/pkg/bufio/bufio.go:262 +0x202
bufio.(*Reader).ReadBytes(0xc2002e4000, 0x48850a, 0x0, 0x0, 0x0, ...)
    /home/rog/go/src/pkg/bufio/bufio.go:343 +0xb7
bufio.(*Reader).ReadString(0xc2002e4000, 0xa, 0x0, 0x0, 0x0, ...)
    /home/rog/go/src/pkg/bufio/bufio.go:383 +0x53
launchpad.net/juju-core/testing.readLines(0xc20015d840, 0xc200000fc0, 0x14,
0xc20015d840, 0xc200000fc0, ...)
    /home/rog/src/go/src/launchpad.net/juju-core/testing/mgo.go:136 +0xa4
launchpad.net/juju-core/testing.func·004()
    /home/rog/src/go/src/launchpad.net/juju-core/testing/mgo.go:80 +0x72
created by launchpad.net/juju-core/testing.startMgoServer
    /home/rog/src/go/src/launchpad.net/juju-core/testing/mgo.go:91 +0x4f8
goroutine 3345 [running]:
syscall.Syscall()
    /home/rog/go/src/pkg/syscall/asm_linux_amd64.s:16 +0x5
created by launchpad.net/gocheck.(*suiteRunner).forkCall
    /home/rog/src/go/src/launchpad.net/gocheck/gocheck.go:609 +0x23e
goroutine 17 [finalizer wait]:
goroutine 3347 [select]:
labix.org/v2/mgo.(*mongoCluster).syncServersLoop(0xc200298000)
    /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:314 +0x45b
created by labix.org/v2/mgo.newCluster
    /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:66 +0xdf
goroutine 883 [select]:
launchpad.net/gocheck.(*resultTracker)._loopRoutine(0xc200665d00)
    /home/rog/src/go/src/launchpad.net/gocheck/gocheck.go:388 +0x2ab
created by launchpad.net/gocheck.(*resultTracker).start
    /home/rog/src/go/src/launchpad.net/gocheck/gocheck.go:368 +0x2c
goroutine 888 [select]:
net/http.(*persistConn).writeLoop(0xc200665d80)
    /home/rog/go/src/pkg/net/http/transport.go:763 +0x26f
created by net/http.(*Transport).dialConn
    /home/rog/go/src/pkg/net/http/transport.go:512 +0x58b
goroutine 887 [IO wait]:
net.runtime_pollWait(0x7fe8b3fbfd20, 0x72, 0x0)
    /home/rog/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2006d6590, 0xb, 0xc20015d9f0)
    /home/rog/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).Read(0xc2006d6510, 0xc2007be000, 0x1000, 0x1000, 0x0, ...)
    /home/rog/go/src/pkg/net/fd_unix.go:192 +0x2b3
net.(*conn).Read(0xc2001ac6a0, 0xc2007be000, 0x1000, 0x1000, 0x0, ...)
    /home/rog/go/src/pkg/net/net.go:123 +0xc3
bufio.(*Reader).fill(0xc200512f00)
    /home/rog/go/src/pkg/bufio/bufio.go:79 +0x10c
bufio.(*Reader).Peek(0xc200512f00, 0x1, 0x0, 0x0, 0x0, ...)
    /home/rog/go/src/pkg/bufio/bufio.go:107 +0xc9
net/http.(*persistConn).readLoop(0xc200665d80)
    /home/rog/go/src/pkg/net/http/transport.go:670 +0xc4
created by net/http.(*Transport).dialConn
    /home/rog/go/src/pkg/net/http/transport.go:511 +0x574
goroutine 885 [IO wait]:
net.runtime_pollWait(0x7fe8ccdb9d20, 0x72, 0x0)
    /home/rog/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2006d6500, 0xb, 0xc20015d9f0)
    /home/rog/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).accept(0xc2006d6480, 0xaae8a8, 0x0, 0xc20015d9f0, 0xb, ...)
    /home/rog/go/src/pkg/net/fd_unix.go:382 +0x2c1
net.(*TCPListener).AcceptTCP(0xc2001ac630, 0x6bdbf6, 0x0, 0x0)
    /home/rog/go/src/pkg/net/tcpsock_posix.go:232 +0x57
net.(*TCPListener).Accept(0xc2001ac630, 0x0, 0x0, 0x0, 0x0, ...)
    /home/rog/go/src/pkg/net/tcpsock_posix.go:242 +0x49
net/http.(*Server).Serve(0xc2003b0230, 0xc20026c040, 0xc2001ac630, 0x0, 0x0, ...)
    /home/rog/go/src/pkg/net/http/server.go:1319 +0x85
net/http.Serve(0xc20026c040, 0xc2001ac630, 0xc200396480, 0xc20024a720, 0x1, ...)
    /home/rog/go/src/pkg/net/http/server.go:1258 +0x68
created by launchpad.net/juju-core/testing.(*HTTPServer).Start
    /home/rog/src/go/src/launchpad.net/juju-core/testing/http.go:67 +0x20f
goroutine 889 [IO wait]:
net.runtime_pollWait(0x7fe8b3fbfc80, 0x72, 0x0)
    /home/rog/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc20031d620, 0xb, 0xc20015d9f0)
    /home/rog/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).Read(0xc20031d5a0, 0xc2007c0000, 0x1000, 0x1000, 0x0, ...)
    /home/rog/go/src/pkg/net/fd_unix.go:192 +0x2b3
net.(*conn).Read(0xc2004a4f20, 0xc2007c0000, 0x1000, 0x1000, 0x7fe8b3febae0, ...)
    /home/rog/go/src/pkg/net/net.go:123 +0xc3
net/http.(*switchReader).Read(0xc20053fea8, 0xc2007c0000, 0x1000, 0x1000, 0x49475e, ...)
    /home/rog/go/src/pkg/net/http/server.go:192 +0x91
io.(*LimitedReader).Read(0xc2009bcf00, 0xc2007c0000, 0x1000, 0x1000, 0x0, ...)
    /home/rog/go/src/pkg/io/io.go:382 +0xc0
bufio.(*Reader).fill(0xc2005bad20)
    /home/rog/go/src/pkg/bufio/bufio.go:79 +0x10c
bufio.(*Reader).ReadSlice(0xc2005bad20, 0x10000000a, 0x0, 0x0, 0x0, ...)
    /home/rog/go/src/pkg/bufio/bufio.go:262 +0x202
bufio.(*Reader).ReadLine(0xc2005bad20, 0x0, 0x0, 0x0, 0xc2007b0800, ...)
    /home/rog/go/src/pkg/bufio/bufio.go:293 +0x61
net/textproto.(*Reader).readLineSlice(0xc200bee030, 0xe00000bf3, 0x41fb4f, 0xc20079c5b0,
0x993aa0, ...)
    /home/rog/go/src/pkg/net/textproto/reader.go:55 +0x51
net/textproto.(*Reader).ReadLine(0xc200bee030, 0xc20079c5b0, 0xc2007c1000, 0x99f740,
0x58cf98, ...)
    /home/rog/go/src/pkg/net/textproto/reader.go:36 +0x25
net/http.ReadRequest(0xc2005bad20, 0xc20079c5b0, 0x0, 0x0)
    /home/rog/go/src/pkg/net/http/request.go:478 +0xa7
net/http.(*conn).readRequest(0xc20053fe80, 0x0, 0x0, 0x0)
    /home/rog/go/src/pkg/net/http/server.go:435 +0x1bc
net/http.(*conn).serve(0xc20053fe80)
    /home/rog/go/src/pkg/net/http/server.go:829 +0x398
created by net/http.(*Server).Serve
    /home/rog/go/src/pkg/net/http/server.go:1341 +0x266
goroutine 3886 [sleep]:
time.Sleep(0x1dcd6500)
    /home/rog/go/src/pkg/runtime/ztime_linux_amd64.c:19 +0x2f
labix.org/v2/mgo.(*mongoCluster).syncServersLoop(0xc200a14ee0)
    /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:291 +0x1ec
created by labix.org/v2/mgo.newCluster
    /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:66 +0xdf
goroutine 3889 [sleep]:
time.Sleep(0x1dcd6500)
    /home/rog/go/src/pkg/runtime/ztime_linux_amd64.c:19 +0x2f
labix.org/v2/mgo.(*mongoCluster).syncServersLoop(0xc200a141c0)
    /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:291 +0x1ec
created by labix.org/v2/mgo.newCluster
    /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:66 +0xdf
goroutine 3349 [IO wait]:
net.runtime_pollWait(0x7fe8ccdb9b40, 0x72, 0x0)
    /home/rog/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
net.(*pollDesc).WaitRead(0xc2006f4c50, 0xb, 0xc20015d9f0)
    /home/rog/go/src/pkg/net/fd_poll_runtime.go:75 +0x31
net.(*netFD).Read(0xc2006f4bd0, 0xc200518800, 0x400, 0x400, 0x0, ...)
    /home/rog/go/src/pkg/net/fd_unix.go:192 +0x2b3
net.(*conn).Read(0xc2003d8570, 0xc200518800, 0x400, 0x400, 0x4896dd, ...)
    /home/rog/go/src/pkg/net/net.go:123 +0xc3
crypto/tls.(*block).readFromUntil(0xc200495930, 0xc20026a0c0, 0xc2003d8570, 0x5,
0xc2003d8570, ...)
    /home/rog/go/src/pkg/crypto/tls/conn.go:401 +0xbd
crypto/tls.(*Conn).readRecord(0xc200457c80, 0x17, 0x0, 0x4089f6)
    /home/rog/go/src/pkg/crypto/tls/conn.go:481 +0xfa
crypto/tls.(*Conn).Read(0xc200457c80, 0xc2009b10f0, 0x24, 0x24, 0x0, ...)
    /home/rog/go/src/pkg/crypto/tls/conn.go:796 +0x102
labix.org/v2/mgo.fill(0xc2001e8de0, 0xc200457c80, 0xc2009b10f0, 0x24, 0x24, ...)
    /home/rog/src/go/src/labix.org/v2/mgo/socket.go:376 +0x59
labix.org/v2/mgo.(*mongoSocket).readLoop(0xc2004fd750)
    /home/rog/src/go/src/labix.org/v2/mgo/socket.go:393 +0x113
created by labix.org/v2/mgo.newSocket
    /home/rog/src/go/src/labix.org/v2/mgo/socket.go:111 +0x24e
rax     0x0
rbx     0xc2007255a0
rcx     0xaa6fa9f0e9525c41
rdx     0x0
rdi     0x0
rsi     0x3
rbp     0x0
rsp     0x7fe8c77fd9c0
r8      0x7fe8c77fdab8
r9      0x7fe8c77fda98
r10     0xc20027e830
r11     0xc20027e770
r12     0x8ee800
r13     0x7d
r14     0xc2003d8268
r15     0xc20015d000
rip     0x4090e2
rflags  0x10286
cs      0x33
fs      0x0
gs      0x0
gopherbot commented 11 years ago

Comment 2:

In the second case: if you run "gdb ./program" and do "disass 0x4090e2", what function
do you see?
rogpeppe commented 11 years ago

Comment 3:

stupidly, i didn't save the binary. i rebuilt it with yesterday's revisions, and i
*think* i've managed to get the same thing, but no guarantees.
this is what i see:
(gdb) disass 0x4090e2
Dump of assembler code for function hash_gciter_next:
   0x0000000000408eb0 <+0>:   mov    %fs:0xfffffffffffffff0,%rcx
   0x0000000000408eb9 <+9>:   cmp    (%rcx),%rsp
   0x0000000000408ebc <+12>:  ja     0x408ec3 <hash_gciter_next+19>
   0x0000000000408ebe <+14>:  callq  0x423860 <runtime.morestack16>
   0x0000000000408ec3 <+19>:  sub    $0x48,%rsp
   0x0000000000408ec7 <+23>:  mov    0x58(%rsp),%rax
   0x0000000000408ecc <+28>:  mov    0x50(%rsp),%r8
   0x0000000000408ed1 <+33>:  mov    (%r8),%rbx
   0x0000000000408ed4 <+36>:  mov    0x10(%r8),%rdi
   0x0000000000408ed8 <+40>:  mov    0x18(%r8),%rbp
   0x0000000000408edc <+44>:  mov    0x20(%r8),%rsi
   0x0000000000408ee0 <+48>:  movq   $0x0,(%rax)
   0x0000000000408ee7 <+55>:  movq   $0x0,0x8(%rax)
   0x0000000000408eef <+63>:  movq   $0x0,0x10(%rax)
   0x0000000000408ef7 <+71>:  mov    %rax,%r9
   0x0000000000408efa <+74>:  movzbl 0x9(%rbx),%ecx
   0x0000000000408efe <+78>:  and    $0x1,%ecx
   0x0000000000408f01 <+81>:  cmp    $0x0,%ecx
   0x0000000000408f04 <+84>:  je     0x409243 <hash_gciter_next+915>
etc
davecheney commented 11 years ago

Comment 4:

There were two recent fixes related to hash maps and gc. Rog could you try rolling
forward to the absolute latest, or back before the new hash map impl.
I also wonder if this error is concurrent hash map mutation exposed in a new form.
davecheney commented 11 years ago

Comment 5:

Status changed to WaitingForReply.

gopherbot commented 11 years ago

Comment 6 by tylor@torbit.com:

I don't think the recent changes have fixed the issue. We were seeing lots of random
SIGSEGVs on 19c06a3356f8, so we updated to 457cc24872ec and the situation did not
improve.
Our actual stack dumps are just as uninteresting as Roger's above. Like Roger's dump we
have many routines waiting on /runtime/znetpoll_linux_amd64.c:118, but I think that is
just a popular place to block. Our application does not use cGo.
bradfitz commented 11 years ago

Comment 7:

Do you use "unsafe"?
dustin commented 11 years ago

Comment 8:

I've been getting a lot of crashes in cbfs, but only on FreeBSD (and seemingly only at
home, though that cluster gets more use:
http://cbugg.hq.couchbase.com/bug/bug-276
I think it's been fully crashing a bit more reliably lately.  It used to kind of "soft
crash" in that the process would stay alive, but important goroutines would fall apart,
causing the cluster to reject this node.
gopherbot commented 11 years ago

Comment 9 by tylor@torbit.com:

>>Do you use "unsafe"?
Yes we do :) although our unsafe code is pretty simple and we have used it frequently in
many applications. We also use "github.com/edsrzf/mmap-go"
I'm reviewing our unsafe and mmap code right now.
We will also try running a node on a pre-hash map changes build.
alberts commented 11 years ago

Comment 10:

Might be related to a crash I'm seeing with code in go.exp:
https://golang.org/issue/4979?c=25
randall77 commented 11 years ago

Comment 11:

Try syncing to pick up https://code.google.com/p/go/source/detail?r=af469280a34b and
trying again.  This patch fixed a bug in multithreaded map reads.
rogpeppe commented 11 years ago

Comment 12:

I haven't been able to reproduce this issue for a while now (which doesn't mean there's
not something still lurking, as the issue surfaced very rarely before). If anyone above
is still seeing similar issues, say so, otherwise I'll mark the issue as Fixed.
davecheney commented 11 years ago

Comment 13:

Status changed to Fixed.