Closed rogpeppe closed 9 years ago
I'm also seeing very occasional SIGSEGVs with no obvious currently running process. Same revision of Go. For example: GOMAXPROCS=83 SIGSEGV: segmentation violation PC=0x4090e2 goroutine 1 [chan receive]: testing.RunTests(0xaae7a0, 0xd90710, 0x1, 0x1, 0xc2ffffff01, ...) /home/rog/go/src/pkg/testing/testing.go:427 +0x88e testing.Main(0xaae7a0, 0xd90710, 0x1, 0x1, 0xd9ae80, ...) /home/rog/go/src/pkg/testing/testing.go:358 +0x8a main.main() launchpad.net/juju-core/worker/uniter/_test/_testmain.go:43 +0x9a goroutine 2 [syscall]: goroutine 5 [chan receive]: launchpad.net/juju-core/environs/dummy.func·001() /home/rog/src/go/src/launchpad.net/juju-core/environs/dummy/environs.go:160 +0x3a created by launchpad.net/juju-core/environs/dummy.init·1 /home/rog/src/go/src/launchpad.net/juju-core/environs/dummy/environs.go:162 +0xc1 goroutine 6 [chan receive]: launchpad.net/gocheck.(*suiteRunner).runTest(0xc200249a10, 0xc200249540, 0xc2005b8c80) /home/rog/src/go/src/launchpad.net/gocheck/gocheck.go:723 +0x4f launchpad.net/gocheck.(*suiteRunner).run(0xc200249a10, 0xc200175240) /home/rog/src/go/src/launchpad.net/gocheck/gocheck.go:564 +0x1c5 launchpad.net/gocheck.Run(0x9a48c0, 0xc200175240, 0xc20026c280, 0xc200805f80) /home/rog/src/go/src/launchpad.net/gocheck/run.go:65 +0x47 launchpad.net/gocheck.RunAll(0xc20026c280, 0xc20026c280) /home/rog/src/go/src/launchpad.net/gocheck/run.go:57 +0xc2 launchpad.net/gocheck.TestingT(0xc2001753f0) /home/rog/src/go/src/launchpad.net/gocheck/run.go:45 +0x98 launchpad.net/juju-core/testing.MgoTestPackage(0xc2001753f0) /home/rog/src/go/src/launchpad.net/juju-core/testing/mgo.go:119 +0xcf launchpad.net/juju-core/worker/uniter_test.TestPackage(0xc2001753f0) /home/rog/src/go/src/launchpad.net/juju-core/worker/uniter/uniter_test.go:32 +0x25 testing.tRunner(0xc2001753f0, 0xd90710) /home/rog/go/src/pkg/testing/testing.go:346 +0x8a created by testing.RunTests /home/rog/go/src/pkg/testing/testing.go:426 +0x86b goroutine 7 [syscall]: syscall.Syscall() /home/rog/go/src/pkg/syscall/asm_linux_amd64.s:16 +0x5 syscall.read(0x4, 0xc20028b000, 0x1000, 0x1000, 0x55, ...) /home/rog/go/src/pkg/syscall/zerrors_linux_amd64.go:2337 +0x70 syscall.Read(0x4, 0xc20028b000, 0x1000, 0x1000, 0x55, ...) /home/rog/go/src/pkg/syscall/syscall_unix.go:132 +0x5a os.(*File).read(0xc200000fc0, 0xc20028b000, 0x1000, 0x1000, 0x55, ...) /home/rog/go/src/pkg/os/file_unix.go:174 +0x60 os.(*File).Read(0xc200000fc0, 0xc20028b000, 0x1000, 0x1000, 0x0, ...) /home/rog/go/src/pkg/os/file.go:95 +0x96 bufio.(*Reader).fill(0xc2002e4000) /home/rog/go/src/pkg/bufio/bufio.go:79 +0x10c bufio.(*Reader).ReadSlice(0xc2002e4000, 0xc1fff9630a, 0x0, 0x0, 0x0, ...) /home/rog/go/src/pkg/bufio/bufio.go:262 +0x202 bufio.(*Reader).ReadBytes(0xc2002e4000, 0x48850a, 0x0, 0x0, 0x0, ...) /home/rog/go/src/pkg/bufio/bufio.go:343 +0xb7 bufio.(*Reader).ReadString(0xc2002e4000, 0xa, 0x0, 0x0, 0x0, ...) /home/rog/go/src/pkg/bufio/bufio.go:383 +0x53 launchpad.net/juju-core/testing.readLines(0xc20015d840, 0xc200000fc0, 0x14, 0xc20015d840, 0xc200000fc0, ...) /home/rog/src/go/src/launchpad.net/juju-core/testing/mgo.go:136 +0xa4 launchpad.net/juju-core/testing.func·004() /home/rog/src/go/src/launchpad.net/juju-core/testing/mgo.go:80 +0x72 created by launchpad.net/juju-core/testing.startMgoServer /home/rog/src/go/src/launchpad.net/juju-core/testing/mgo.go:91 +0x4f8 goroutine 3345 [running]: syscall.Syscall() /home/rog/go/src/pkg/syscall/asm_linux_amd64.s:16 +0x5 created by launchpad.net/gocheck.(*suiteRunner).forkCall /home/rog/src/go/src/launchpad.net/gocheck/gocheck.go:609 +0x23e goroutine 17 [finalizer wait]: goroutine 3347 [select]: labix.org/v2/mgo.(*mongoCluster).syncServersLoop(0xc200298000) /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:314 +0x45b created by labix.org/v2/mgo.newCluster /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:66 +0xdf goroutine 883 [select]: launchpad.net/gocheck.(*resultTracker)._loopRoutine(0xc200665d00) /home/rog/src/go/src/launchpad.net/gocheck/gocheck.go:388 +0x2ab created by launchpad.net/gocheck.(*resultTracker).start /home/rog/src/go/src/launchpad.net/gocheck/gocheck.go:368 +0x2c goroutine 888 [select]: net/http.(*persistConn).writeLoop(0xc200665d80) /home/rog/go/src/pkg/net/http/transport.go:763 +0x26f created by net/http.(*Transport).dialConn /home/rog/go/src/pkg/net/http/transport.go:512 +0x58b goroutine 887 [IO wait]: net.runtime_pollWait(0x7fe8b3fbfd20, 0x72, 0x0) /home/rog/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82 net.(*pollDesc).WaitRead(0xc2006d6590, 0xb, 0xc20015d9f0) /home/rog/go/src/pkg/net/fd_poll_runtime.go:75 +0x31 net.(*netFD).Read(0xc2006d6510, 0xc2007be000, 0x1000, 0x1000, 0x0, ...) /home/rog/go/src/pkg/net/fd_unix.go:192 +0x2b3 net.(*conn).Read(0xc2001ac6a0, 0xc2007be000, 0x1000, 0x1000, 0x0, ...) /home/rog/go/src/pkg/net/net.go:123 +0xc3 bufio.(*Reader).fill(0xc200512f00) /home/rog/go/src/pkg/bufio/bufio.go:79 +0x10c bufio.(*Reader).Peek(0xc200512f00, 0x1, 0x0, 0x0, 0x0, ...) /home/rog/go/src/pkg/bufio/bufio.go:107 +0xc9 net/http.(*persistConn).readLoop(0xc200665d80) /home/rog/go/src/pkg/net/http/transport.go:670 +0xc4 created by net/http.(*Transport).dialConn /home/rog/go/src/pkg/net/http/transport.go:511 +0x574 goroutine 885 [IO wait]: net.runtime_pollWait(0x7fe8ccdb9d20, 0x72, 0x0) /home/rog/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82 net.(*pollDesc).WaitRead(0xc2006d6500, 0xb, 0xc20015d9f0) /home/rog/go/src/pkg/net/fd_poll_runtime.go:75 +0x31 net.(*netFD).accept(0xc2006d6480, 0xaae8a8, 0x0, 0xc20015d9f0, 0xb, ...) /home/rog/go/src/pkg/net/fd_unix.go:382 +0x2c1 net.(*TCPListener).AcceptTCP(0xc2001ac630, 0x6bdbf6, 0x0, 0x0) /home/rog/go/src/pkg/net/tcpsock_posix.go:232 +0x57 net.(*TCPListener).Accept(0xc2001ac630, 0x0, 0x0, 0x0, 0x0, ...) /home/rog/go/src/pkg/net/tcpsock_posix.go:242 +0x49 net/http.(*Server).Serve(0xc2003b0230, 0xc20026c040, 0xc2001ac630, 0x0, 0x0, ...) /home/rog/go/src/pkg/net/http/server.go:1319 +0x85 net/http.Serve(0xc20026c040, 0xc2001ac630, 0xc200396480, 0xc20024a720, 0x1, ...) /home/rog/go/src/pkg/net/http/server.go:1258 +0x68 created by launchpad.net/juju-core/testing.(*HTTPServer).Start /home/rog/src/go/src/launchpad.net/juju-core/testing/http.go:67 +0x20f goroutine 889 [IO wait]: net.runtime_pollWait(0x7fe8b3fbfc80, 0x72, 0x0) /home/rog/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82 net.(*pollDesc).WaitRead(0xc20031d620, 0xb, 0xc20015d9f0) /home/rog/go/src/pkg/net/fd_poll_runtime.go:75 +0x31 net.(*netFD).Read(0xc20031d5a0, 0xc2007c0000, 0x1000, 0x1000, 0x0, ...) /home/rog/go/src/pkg/net/fd_unix.go:192 +0x2b3 net.(*conn).Read(0xc2004a4f20, 0xc2007c0000, 0x1000, 0x1000, 0x7fe8b3febae0, ...) /home/rog/go/src/pkg/net/net.go:123 +0xc3 net/http.(*switchReader).Read(0xc20053fea8, 0xc2007c0000, 0x1000, 0x1000, 0x49475e, ...) /home/rog/go/src/pkg/net/http/server.go:192 +0x91 io.(*LimitedReader).Read(0xc2009bcf00, 0xc2007c0000, 0x1000, 0x1000, 0x0, ...) /home/rog/go/src/pkg/io/io.go:382 +0xc0 bufio.(*Reader).fill(0xc2005bad20) /home/rog/go/src/pkg/bufio/bufio.go:79 +0x10c bufio.(*Reader).ReadSlice(0xc2005bad20, 0x10000000a, 0x0, 0x0, 0x0, ...) /home/rog/go/src/pkg/bufio/bufio.go:262 +0x202 bufio.(*Reader).ReadLine(0xc2005bad20, 0x0, 0x0, 0x0, 0xc2007b0800, ...) /home/rog/go/src/pkg/bufio/bufio.go:293 +0x61 net/textproto.(*Reader).readLineSlice(0xc200bee030, 0xe00000bf3, 0x41fb4f, 0xc20079c5b0, 0x993aa0, ...) /home/rog/go/src/pkg/net/textproto/reader.go:55 +0x51 net/textproto.(*Reader).ReadLine(0xc200bee030, 0xc20079c5b0, 0xc2007c1000, 0x99f740, 0x58cf98, ...) /home/rog/go/src/pkg/net/textproto/reader.go:36 +0x25 net/http.ReadRequest(0xc2005bad20, 0xc20079c5b0, 0x0, 0x0) /home/rog/go/src/pkg/net/http/request.go:478 +0xa7 net/http.(*conn).readRequest(0xc20053fe80, 0x0, 0x0, 0x0) /home/rog/go/src/pkg/net/http/server.go:435 +0x1bc net/http.(*conn).serve(0xc20053fe80) /home/rog/go/src/pkg/net/http/server.go:829 +0x398 created by net/http.(*Server).Serve /home/rog/go/src/pkg/net/http/server.go:1341 +0x266 goroutine 3886 [sleep]: time.Sleep(0x1dcd6500) /home/rog/go/src/pkg/runtime/ztime_linux_amd64.c:19 +0x2f labix.org/v2/mgo.(*mongoCluster).syncServersLoop(0xc200a14ee0) /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:291 +0x1ec created by labix.org/v2/mgo.newCluster /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:66 +0xdf goroutine 3889 [sleep]: time.Sleep(0x1dcd6500) /home/rog/go/src/pkg/runtime/ztime_linux_amd64.c:19 +0x2f labix.org/v2/mgo.(*mongoCluster).syncServersLoop(0xc200a141c0) /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:291 +0x1ec created by labix.org/v2/mgo.newCluster /home/rog/src/go/src/labix.org/v2/mgo/cluster.go:66 +0xdf goroutine 3349 [IO wait]: net.runtime_pollWait(0x7fe8ccdb9b40, 0x72, 0x0) /home/rog/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82 net.(*pollDesc).WaitRead(0xc2006f4c50, 0xb, 0xc20015d9f0) /home/rog/go/src/pkg/net/fd_poll_runtime.go:75 +0x31 net.(*netFD).Read(0xc2006f4bd0, 0xc200518800, 0x400, 0x400, 0x0, ...) /home/rog/go/src/pkg/net/fd_unix.go:192 +0x2b3 net.(*conn).Read(0xc2003d8570, 0xc200518800, 0x400, 0x400, 0x4896dd, ...) /home/rog/go/src/pkg/net/net.go:123 +0xc3 crypto/tls.(*block).readFromUntil(0xc200495930, 0xc20026a0c0, 0xc2003d8570, 0x5, 0xc2003d8570, ...) /home/rog/go/src/pkg/crypto/tls/conn.go:401 +0xbd crypto/tls.(*Conn).readRecord(0xc200457c80, 0x17, 0x0, 0x4089f6) /home/rog/go/src/pkg/crypto/tls/conn.go:481 +0xfa crypto/tls.(*Conn).Read(0xc200457c80, 0xc2009b10f0, 0x24, 0x24, 0x0, ...) /home/rog/go/src/pkg/crypto/tls/conn.go:796 +0x102 labix.org/v2/mgo.fill(0xc2001e8de0, 0xc200457c80, 0xc2009b10f0, 0x24, 0x24, ...) /home/rog/src/go/src/labix.org/v2/mgo/socket.go:376 +0x59 labix.org/v2/mgo.(*mongoSocket).readLoop(0xc2004fd750) /home/rog/src/go/src/labix.org/v2/mgo/socket.go:393 +0x113 created by labix.org/v2/mgo.newSocket /home/rog/src/go/src/labix.org/v2/mgo/socket.go:111 +0x24e rax 0x0 rbx 0xc2007255a0 rcx 0xaa6fa9f0e9525c41 rdx 0x0 rdi 0x0 rsi 0x3 rbp 0x0 rsp 0x7fe8c77fd9c0 r8 0x7fe8c77fdab8 r9 0x7fe8c77fda98 r10 0xc20027e830 r11 0xc20027e770 r12 0x8ee800 r13 0x7d r14 0xc2003d8268 r15 0xc20015d000 rip 0x4090e2 rflags 0x10286 cs 0x33 fs 0x0 gs 0x0
stupidly, i didn't save the binary. i rebuilt it with yesterday's revisions, and i *think* i've managed to get the same thing, but no guarantees. this is what i see: (gdb) disass 0x4090e2 Dump of assembler code for function hash_gciter_next: 0x0000000000408eb0 <+0>: mov %fs:0xfffffffffffffff0,%rcx 0x0000000000408eb9 <+9>: cmp (%rcx),%rsp 0x0000000000408ebc <+12>: ja 0x408ec3 <hash_gciter_next+19> 0x0000000000408ebe <+14>: callq 0x423860 <runtime.morestack16> 0x0000000000408ec3 <+19>: sub $0x48,%rsp 0x0000000000408ec7 <+23>: mov 0x58(%rsp),%rax 0x0000000000408ecc <+28>: mov 0x50(%rsp),%r8 0x0000000000408ed1 <+33>: mov (%r8),%rbx 0x0000000000408ed4 <+36>: mov 0x10(%r8),%rdi 0x0000000000408ed8 <+40>: mov 0x18(%r8),%rbp 0x0000000000408edc <+44>: mov 0x20(%r8),%rsi 0x0000000000408ee0 <+48>: movq $0x0,(%rax) 0x0000000000408ee7 <+55>: movq $0x0,0x8(%rax) 0x0000000000408eef <+63>: movq $0x0,0x10(%rax) 0x0000000000408ef7 <+71>: mov %rax,%r9 0x0000000000408efa <+74>: movzbl 0x9(%rbx),%ecx 0x0000000000408efe <+78>: and $0x1,%ecx 0x0000000000408f01 <+81>: cmp $0x0,%ecx 0x0000000000408f04 <+84>: je 0x409243 <hash_gciter_next+915> etc
Comment 6 by tylor@torbit.com:
I don't think the recent changes have fixed the issue. We were seeing lots of random SIGSEGVs on 19c06a3356f8, so we updated to 457cc24872ec and the situation did not improve. Our actual stack dumps are just as uninteresting as Roger's above. Like Roger's dump we have many routines waiting on /runtime/znetpoll_linux_amd64.c:118, but I think that is just a popular place to block. Our application does not use cGo.
I've been getting a lot of crashes in cbfs, but only on FreeBSD (and seemingly only at home, though that cluster gets more use: http://cbugg.hq.couchbase.com/bug/bug-276 I think it's been fully crashing a bit more reliably lately. It used to kind of "soft crash" in that the process would stay alive, but important goroutines would fall apart, causing the cluster to reject this node.
Comment 9 by tylor@torbit.com:
>>Do you use "unsafe"? Yes we do :) although our unsafe code is pretty simple and we have used it frequently in many applications. We also use "github.com/edsrzf/mmap-go" I'm reviewing our unsafe and mmap code right now. We will also try running a node on a pre-hash map changes build.
Might be related to a crash I'm seeing with code in go.exp: https://golang.org/issue/4979?c=25
Try syncing to pick up https://code.google.com/p/go/source/detail?r=af469280a34b and trying again. This patch fixed a bug in multithreaded map reads.