golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.08k stars 17.68k forks source link

all: dubious looking panic #5193

Closed rogpeppe closed 9 years ago

rogpeppe commented 11 years ago
I just saw this panic in 672069179b80 tip.

It doesn't look possible, which makes me suspect
a compiler or runtime bug.

The line it's panicking on is the second line of:

    for width := 0; len(s) > 0; s = s[width:] {
        r := rune(s[0])

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x1 pc=0x49c163]

goroutine 2633 [running]:
strconv.quoteWith(0x1, 0x8, 0x22, 0x0, 0x9344a0, ...)
    /home/rog/go/src/pkg/strconv/quote.go:18 +0x163
strconv.Quote(0x1, 0x8, 0x8, 0x1)
    /home/rog/go/src/pkg/strconv/quote.go:91 +0x39
fmt.(*fmt).fmt_q(0xc2006ae6d8, 0x1, 0x8)
    /home/rog/go/src/pkg/fmt/format.go:322 +0x126
fmt.(*pp).fmtString(0xc2006ae680, 0x1, 0x8, 0x100000076)
    /home/rog/go/src/pkg/fmt/print.go:531 +0xd9
fmt.(*pp).printReflectValue(0xc2006ae680, 0x83fa60, 0xc200472360, 0x183, 0x10000000076,
...)
    /home/rog/go/src/pkg/fmt/print.go:879 +0x2e32
strconv.quoteWith(0x1, 0x8, 0x22, 0x0, 0x9344a0, ...)
    /home/rog/go/src/pkg/strconv/quote.go:18 +0x163
strconv.Quote(0x1, 0x8, 0x8, 0x1)
    /home/rog/go/src/pkg/strconv/quote.go:91 +0x39
fmt.(*fmt).fmt_q(0xc2006ae6d8, 0x1, 0x8)
    /home/rog/go/src/pkg/fmt/format.go:322 +0x126
fmt.(*pp).fmtString(0xc2006ae680, 0x1, 0x8, 0x100000076)
    /home/rog/go/src/pkg/fmt/print.go:531 +0xd9
fmt.(*pp).printReflectValue(0xc2006ae680, 0x83fa60, 0xc200472360, 0x183, 0x10000000076,
...)
    /home/rog/go/src/pkg/fmt/print.go:879 +0x2e32
fmt.(*pp).printValue(0xc2006ae680, 0x83fa60, 0xc200472360, 0x183, 0x10000000076, ...)
    /home/rog/go/src/pkg/fmt/print.go:850 +0x2f5
fmt.(*pp).printReflectValue(0xc2006ae680, 0x96c640, 0xc200472360, 0x193, 0x10000000076,
...)
    /home/rog/go/src/pkg/fmt/print.go:930 +0x325b
fmt.(*pp).printValue(0xc2006ae680, 0x96c640, 0xc200472360, 0x193, 0x10000000076, ...)
    /home/rog/go/src/pkg/fmt/print.go:850 +0x2f5
fmt.(*pp).printReflectValue(0xc2006ae680, 0x938480, 0xc200472360, 0x192, 0x10000000076,
...)
    /home/rog/go/src/pkg/fmt/print.go:930 +0x325b
fmt.(*pp).printField(0xc2006ae680, 0x938480, 0xc200472360, 0x10000000076, 0x0, ...)
    /home/rog/go/src/pkg/fmt/print.go:812 +0x3d1
fmt.(*pp).doPrintf(0xc2006ae680, 0xc200472390, 0x25, 0x7ff0f6599c68, 0x1, ...)
    /home/rog/go/src/pkg/fmt/print.go:1110 +0x1270
fmt.Sprintf(0xc200472390, 0x25, 0x7ff0f6599c68, 0x1, 0x1, ...)
    /home/rog/go/src/pkg/fmt/print.go:228 +0x65
launchpad.net/juju-core/log.logf(0xc200472390, 0x25, 0x7ff0f6599c68, 0x1, 0x1, ...)
    /home/rog/src/go/src/launchpad.net/juju-core/log/log.go:40 +0x74
launchpad.net/juju-core/log.Debugf(0xa67a30, 0x1f, 0x7ff0f6599c68, 0x1, 0x1, ...)
    /home/rog/src/go/src/launchpad.net/juju-core/log/log.go:68 +0xaa
launchpad.net/juju-core/state/watcher.(*Watcher).handle(0xc200724dc0, 0x938480,
0xc200472360)
    /home/rog/src/go/src/launchpad.net/juju-core/state/watcher/watcher.go:278 +0xaa
launchpad.net/juju-core/state/watcher.(*Watcher).loop(0xc200724dc0, 0xc20012f000,
0xc2004046d0)
    /home/rog/src/go/src/launchpad.net/juju-core/state/watcher/watcher.go:231 +0x242
launchpad.net/juju-core/state/watcher.func·001()
    /home/rog/src/go/src/launchpad.net/juju-core/state/watcher/watcher.go:111 +0x2c
created by launchpad.net/juju-core/state/watcher.New
    /home/rog/src/go/src/launchpad.net/juju-core/state/watcher/watcher.go:113 +0x100
rogpeppe commented 11 years ago

Comment 1:

I'm seeing quite a few other similar panics and errors. It looks like some kind of
memory corruption. Here's another example. I haven't narrowed down the example yet, but
this code has been stable for a while, and it passes in other Go revisions. I will try
to bisect tomorrow.
Here's another example panic.
goroutine 1478 [running]:
sync/atomic.CompareAndSwapUint32()
    /home/rog/go/src/pkg/sync/atomic/asm_amd64.s:14 +0xd
sync.(*Mutex).Lock(0xa)
    /home/rog/go/src/pkg/sync/mutex.go:43 +0x35
labix.org/v2/mgo.(*mongoSocket).Acquire(0xa, 0x0)
    /home/rog/src/go/src/labix.org/v2/mgo/socket.go:151 +0x2a
labix.org/v2/mgo.(*Session).acquireSocket(0xc2006fa870, 0xc200730a01, 0x0, 0x0, 0x0, ...)
    /home/rog/src/go/src/labix.org/v2/mgo/session.go:2903 +0xce
labix.org/v2/mgo.(*Query).One(0xc20052cd90, 0x830400, 0xc2005d5de0, 0x0, 0x0, ...)
    /home/rog/src/go/src/labix.org/v2/mgo/session.go:1961 +0xae
launchpad.net/juju-core/state.(*Settings).Read(0xc2005d5d50, 0x0, 0xc2005d5db0)
    /home/rog/src/go/src/launchpad.net/juju-core/state/settings.go:193 +0xb2
launchpad.net/juju-core/state.readSettings(0xc200389460, 0xc200290a00, 0x19,
0x17d9f5f40, 0x3000421850, ...)
    /home/rog/src/go/src/launchpad.net/juju-core/state/settings.go:212 +0xa8
launchpad.net/juju-core/state.(*settingsWatcher).loop(0xc20039a700, 0xc200290a00, 0x19,
0x0, 0x0, ...)
    /home/rog/src/go/src/launchpad.net/juju-core/state/watcher.go:979 +0x7d
launchpad.net/juju-core/state.func·022()
    /home/rog/src/go/src/launchpad.net/juju-core/state/watcher.go:965 +0x8e
created by launchpad.net/juju-core/state.newSettingsWatcher
    /home/rog/src/go/src/launchpad.net/juju-core/state/watcher.go:966 +0xd3
bradfitz commented 11 years ago

Comment 2:

Is the code or a minimal repro available somewhere so we can help you bisect?

Labels changed: added go1.1, removed priority-triage.

Status changed to Accepted.

gopherbot commented 11 years ago

Comment 3 by tylor@torbit.com:

Could this be the same issue as 5091 ("runtime: unlikely looking panic in append")?
https://golang.org/issue/5091
I've been getting weird SIGSEGVs in inconsistent places that I have been unsuccessful in
reproducing on demand. Here is a recent example:
https://groups.google.com/forum/?fromgroups=#!topic/golang-nuts/uMUmCjMCBkU
ianlancetaylor commented 11 years ago

Comment 4:

What system types is this problem occurring on?

Labels changed: added priority-asap.

gopherbot commented 11 years ago

Comment 5 by tylor@torbit.com:

The issues I have seen are with long running apps running Go tip/Linux 3.4/amd64.
bradfitz commented 11 years ago

Comment 6:

Tylor, right now we have no theories as to the cause.  If you have repro code we could
run or if you could bisect to find a good->bad range would be useful.  To clarify, which
is the last revision that crashed for you?  (Tip has been moving so fast the past few
weeks that saying "tip" is nearly useless for debugging this)
Thanks!
davecheney commented 11 years ago

Comment 7:

responding to #1
goroutine 1478 [running]:
sync/atomic.CompareAndSwapUint32()
    /home/rog/go/src/pkg/sync/atomic/asm_amd64.s:14 +0xd
sync.(*Mutex).Lock(0xa)
    /home/rog/go/src/pkg/sync/mutex.go:43 +0x35
labix.org/v2/mgo.(*mongoSocket).Acquire(0xa, 0x0)
    /home/rog/src/go/src/labix.org/v2/mgo/socket.go:151 +0x2a
labix.org/v2/mgo.(*Session).acquireSocket(0xc2006fa870, 0xc200730a01, 0x0, 0x0, 0x0, ...)
This looks like *mongoSocket was nil.
dvyukov commented 11 years ago

Comment 8:

The 0x1 is what GC writes into second word of freed blocks. We've seen such crashes
before due to bugs in GC.
But I don't know why that 0x1 is on the stack...
gopherbot commented 11 years ago

Comment 9 by voidlogic7:

Brad, various tip builds as recent as 3bc1ecb89e30.
I have some changes that pending that once deployed should allow me to better debugging,
but I have attached some panics in hope that they help. Until then I eagerly await
Rogers bisect. Thanks!

Attachments:

  1. panic1.txt (2982 bytes)
  2. panic2.txt (3770 bytes)
  3. panic3.txt (860 bytes)
rogpeppe commented 11 years ago

Comment 10:

Progress report:
I've been bisecting away and unfortunately the best I've done so far is that the problem
(well, a problem anyway, as the symptom is variable) was introduced between revisions
15910 and 16052. In between those revisions I can't build our code because of the bug
that was introduced in 15910.
I will try to bisect the revisions in between by updating and then manually applying a
16053.
I suspect one of the GC-related changes, possibly 15945:37d391191e3f
alberts commented 11 years ago

Comment 11:

Maybe this helps, but maybe you know this already: if your build was only broken for a
few revisions in that range, you can exit the bisect script with value 125 to tell
Mercurial to just skip over it.
rogpeppe commented 11 years ago

Comment 12:

it was broken for all the revisions in that range. it's ok, applying the patch is
working, so the bisect continues...
rogpeppe commented 11 years ago

Comment 13:

i've narrowed it down to revision 16008 (the new scheduler). it's quite possible that
this particular bisection has found a bug in our own code, not the go runtime.
i will try bisecting a different revision of our own code to try to verify that.
remyoudompheng commented 11 years ago

Comment 14:

Did you run the code under the race detector?
rogpeppe commented 11 years ago

Comment 15:

thanks remy for that good suggestion, which is something i should have done to start
with! we do indeed have a data race going on; i will fix it and see if we still have a
problem.
rogpeppe commented 11 years ago

Comment 16:

the problem was a race in a dependency. it had been fixed but i had not pulled it. i can
no longer reproduce the symptoms, so i am marking this issue Invalid.
sorry for the noise folks.
some day we'll have a good way to get reproducible builds, i hope.

Status changed to Invalid.

rogpeppe commented 11 years ago

Comment 17:

hmm, I'm not so sure any more. at the exact moment i clicked "Save Changes" then, my
looping test runner died after several successful runs with this stack trace.
http://paste.ubuntu.com/5676399/
here is the disassembly of scanblock, the function it died inside:
http://paste.ubuntu.com/5676404/
re-opening until i can decide that this is not an issue.

Status changed to Accepted.

rogpeppe commented 11 years ago

Comment 18:

here's another one: http://paste.ubuntu.com/5676861/
go version e77430da3316
gopherbot commented 11 years ago

Comment 19 by tylor@torbit.com:

Here is an example of a common crash I am seeing. The GOB encoder object in question is
local to the crashing goroutine, so something like a race condition seems very unlikely.
This crash has occurred with tip: 3bc1ecb89e30, bb8b458a479c and probably others. I have
been unable to reproduce it with simple programs. The problem appears to be on line 35
of encoding/gob/encode.go and involve the free list:
    29  func (enc *Encoder) newEncoderState(b *bytes.Buffer) *encoderState {
    30      e := enc.freeList
    31      if e == nil {
    32          e = new(encoderState)
    33          e.enc = enc
    34      } else {
    35          enc.freeList = e.next
    36      }
    37      e.sendZero = false
    38      e.fieldnum = 0
    39      e.b = b
    40      return e
    41  }
Here is the actual stack trace:
 panic: runtime error: invalid memory address or nil pointer dereference [recovered]
       panic: runtime error: invalid memory address or nil pointer dereference
 [signal 0xb code=0x1 addr=0x40 pc=0x5a5764]
 goroutine 99545 [running]:
 encoding/gob.catchError(0xc2004453a8)
       /home/user/go/src/pkg/encoding/gob/error.go:38 +0x8d
 encoding/gob.(*Encoder).newEncoderState(0xc2000d20f0, 0xc200445338, 0x18)
       /home/user/go/src/pkg/encoding/gob/encode.go:35 +0x64
 encoding/gob.(*Encoder).encodeArray(0xc2000d20f0, 0xc200445338, 0xc200625700, 0x8267e0, 0x10, ...)
       /home/user/go/src/pkg/encoding/gob/encode.go:377 +0x34
 encoding/gob.func·013(0x0, 0xc200617910, 0xc2004a80a0)
       /home/user/go/src/pkg/encoding/gob/encode.go:585 +0xf2
 encoding/gob.encodeReflectValue(0xc200617910, 0x682280, 0xc2004a8080, 0x172, 0xc20014ad00, ...)
       /home/user/go/src/pkg/encoding/gob/encode.go:404 +0x105
 encoding/gob.(*Encoder).encodeMap(0xc2006b7a80, 0xc200445338, 0x762260, 0xc200aba450, 0x156, ...)
       /home/user/go/src/pkg/encoding/gob/encode.go:418 +0x1e0
 encoding/gob.func·015(0xc2001d92a0, 0xc2006b7940, 0xc200aba450)
       /home/user/go/src/pkg/encoding/gob/encode.go:609 +0x18e
 encoding/gob.(*Encoder).encodeStruct(0xc200445300, 0xc200445338, 0xc20014acc0, 0xc200aba410)
       /home/user/go/src/pkg/encoding/gob/encode.go:370 +0x118
 encoding/gob.func·016(0xc2001cc700, 0xc2006b7900, 0xc200aba410)
       /home/user/go/src/pkg/encoding/gob/encode.go:618 +0x6d
 encoding/gob.(*Encoder).encodeStruct(0xc200445300, 0xc200445338, 0xc20014ac80, 0xc200aba410)
       /home/user/go/src/pkg/encoding/gob/encode.go:370 +0x118
 encoding/gob.(*Encoder).encode(0xc200445300, 0xc200445338, 0x736320, 0xc200aba410, 0x196, ...)
       /home/user/go/src/pkg/encoding/gob/encode.go:746 +0x182
 encoding/gob.(*Encoder).EncodeValue(0xc200445300, 0x70ce40, 0xc200aba410, 0x160, 0x0, ...)
       /home/user/go/src/pkg/encoding/gob/encoder.go:252 +0x387
 encoding/gob.(*Encoder).Encode(0xc200445300, 0x70ce40, 0xc200aba410, 0x1, 0x0, ...)
       /home/user/go/src/pkg/encoding/gob/encoder.go:177 +0x5b
 xxx/yyy.(*GobEndpointClient).requestStrmWorker(0xc2005a71e0)
       /home/user/code/go/src/xxx/yyy/GobEndpointClient.go:101 +0x141
 created by xxx/yyy.NewGobEndpointClient
       /home/user/code/go/src/xxx/yyy/GobEndpointClient.go:46 +0xf1
.....
Please let me know if there is any additional information I can supply or steps I should
take. (Or if this should be a separate issue) Thanks!
bradfitz commented 11 years ago

Comment 20:

rog's one from comment #18 starts with a GC panic:
panic: invalid memory address or nil pointer dereference
fatal error: panic during gc
[signal 0xb code=0x1 addr=0x2f pc=0x42a693]
goroutine 3442 [running]:
[fp=0xc2009be5c8] runtime.throw(0xfa3b9c)
    /home/rog/go-release/src/pkg/runtime/panic.c:473 +0x67
[fp=0xc2009be608] runtime.panicstring(0xfa3908)
    /home/rog/go-release/src/pkg/runtime/panic.c:484 +0x51
[fp=0xc2009be620] runtime.sigpanic()
    /home/rog/go-release/src/pkg/runtime/os_linux.c:236 +0x168
[fp=0xc2009be9c0] scanblock(0x7f1e1a678000, 0x7f1e1a6781f8, 0x14, 0xc2009be900)
    /home/rog/go-release/src/pkg/runtime/mgc0.c:793 +0xa93
[fp=0xc2009bea10] markroot(0xc20010b480, 0x1000000010)
    /home/rog/go-release/src/pkg/runtime/mgc0.c:1199 +0xab
[fp=0xc2009bea88] runtime.parfordo(0xc20010b480)
    /home/rog/go-release/src/pkg/runtime/parfor.c:105 +0x9b
[fp=0xc2009bebb8] gc(0x7f1e18e4b354)
    /home/rog/go-release/src/pkg/runtime/mgc0.c:1930 +0x29d
----- stack segment boundary -----
[fp=0x7f1e18e4b368] runtime.gc(0xc200000000)
    /home/rog/go-release/src/pkg/runtime/mgc0.c:1857 +0x11b
[fp=0x7f1e18e4b3c0] runtime.mallocgc(0x1000, 0x100000001, 0x600000001)
    /home/rog/go-release/src/pkg/runtime/zmalloc_linux_amd64.c:101 +0x1e4
[fp=0x7f1e18e4b3f8] makeslice1(0x82a780, 0x1000, 0x1000, 0x7f1e18e4b440)
    /home/rog/go-release/src/pkg/runtime/slice.c:61 +0x89
[fp=0x7f1e18e4b428] runtime.makeslice(0x82a780, 0x1000, 0x1000, 0xc2004304b0, 0x1000,
...)
    /home/rog/go-release/src/pkg/runtime/slice.c:34 +0x9a
[fp=0x7f1e18e4b4d8] os.(*File).readdirnames(0xc200389a68, 0x1, 0x0, 0x0, 0x0, ...)
    /home/rog/go-release/src/pkg/os/dir_unix.go:23 +0xb8
rogpeppe commented 11 years ago

Comment 21:

I spent some time changing the code so that it no longer uses unsafe (it calls the old
unsafe code through an rpc interface), and I managed to trigger a panic again.
I saw the same panic as in reply #1 above.
@dave cheney: the mongoSocket isn't nil, it really is 0xa.
I put this check at the start of the function:
    if socket == nil {
        panic("nil socket")
    }
    if uintptr(unsafe.Pointer(socket)) < 0x100 {
        panic(fmt.Errorf("socket is %p", socket))
    }
and the second panic was triggered, not the first.
rogpeppe commented 11 years ago

Comment 23:

some more panics for the collection, all in code with cgo excised, GOMAXPROCS=1, go
version e77430da3316, ubuntu 12.04, amd64. The tests pass with the race detector enabled.
http://paste.ubuntu.com/5679089/
http://paste.ubuntu.com/5679077/ (these are the same as the panic in comments #1 and #21
above)
http://paste.ubuntu.com/5679072/
http://paste.ubuntu.com/5679084/
http://paste.ubuntu.com/5679086/
http://paste.ubuntu.com/5679090/
rogpeppe commented 11 years ago

Comment 24:

And another one I haven't seen before:
http://paste.ubuntu.com/5679671/
rogpeppe commented 11 years ago

Comment 25:

Unfortunately I've no more time to spend on this issue.
If anyone else wants to try to reproduce the symptoms, you'll need to be using Ubuntu on
amd64. You can get the Go code with "go get launchpad.net/juju-core/worker/uniter".
The tests will not run unless you have an SSL-capable mongod binary in your path.
You can grab one from
http://juju-dist.s3.amazonaws.com/tools/mongo-2.2.0-precise-amd64.tgz . The sha1sum of
that tar archive is 608f4a5b84b8a4cfab8484aa90976f051382b957.
Once you've done that, "go test launchpad.net/juju-core/worker/uniter" should reproduce
the problem about one time in four. If you want to run it with the race detector, you'll
need to raise the test timeout (I used 2h though that's overkill)
I'm sorry I haven't managed to reduce the amount of code involved.
For now, I'm reverting to Go revision 15909.
DanielMorsing commented 11 years ago

Comment 26:

Do you have a patch for excising cgo from the code? I'm trying to inspect the binary
under gdb, and the host linking mode is messing up the python script.
rogpeppe commented 11 years ago

Comment 27:

@daniel to excise cgo from the code, this sequence should work (assuming a
single-element GOPATH):
cd $GOPATH/launchpad.net/juju-core/worker/uniter
bzr pull --overwrite lp:~rogpeppe/juju-core/268-no-cgo
mv $GOPATH/launchpad.net/goyaml $GOPATH/launchpad.net/goyaml.orig
cd $GOPATH/launchpad.net
bzr branch lp:~rogpeppe/+junk/yamlserver goyaml
rm -r $GOPATH/pkg     # stop the go tool from seeing the old goyaml archives
cd goyaml
go run server.go &
go test
please let me know if you have problems - i haven't tested the above.
remember to move goyaml.orig back to goyaml when you're done :-)
rogpeppe commented 11 years ago

Comment 28:

for the record, i saw all the tracebacks in comment #23 above with no cgo and with the
potential race condition in mgo fixed as Dmitry suggested elsewhere.
DanielMorsing commented 11 years ago

Comment 29:

With the cgo excised binary and GOTRACEBACK=crash, i've managed to snag a core dump of
the failing process.
interestingly, there are a couple of spots where the args have been marked 0xdead and
they all happen to be strings.
DanielMorsing commented 11 years ago

Comment 30:

Here's a little gdb session that should shed some light on the problem.
(gdb) bt
#0  runtime.raise (sig=void)
    at /home/daniel/src/go/src/pkg/runtime/sys_linux_amd64.s:85
 /* uninteresting frames */
#8  0x0000000000497d89 in fmt.(*pp).fmtString (p=0xc20043e4e0, v=
    <error reading variable: Address 0xdeaddeaddeaddead out of bounds>, verb=
    118, goSyntax=true) at /home/daniel/src/go/src/pkg/fmt/print.go:531
#9  0x000000000049d052 in fmt.(*pp).printReflectValue (p= []uint8 = {...}, 
    value=..., verb=118, plus=false, goSyntax=true, depth=2, wasString=false)
    at /home/daniel/src/go/src/pkg/fmt/print.go:879
 /* a bunch of uninteresting frames */
(gdb) f 9
#9  0x000000000049d052 in fmt.(*pp).printReflectValue (p= []uint8 = {...}, 
    value=..., verb=118, plus=false, goSyntax=true, depth=2, wasString=false)
    at /home/daniel/src/go/src/pkg/fmt/print.go:879
879         p.fmtString(f.String(), verb, goSyntax)
(gdb) whatis f
type = reflect.StructField
/* actual type is reflect.Value */
(gdb) info locals f
Python Exception <type 'exceptions.OverflowError'> signed integer is greater than
maximum: 
Python Exception <type 'exceptions.OverflowError'> signed integer is greater than
maximum: 
f = {Name = , PkgPath = , Type = {tab = 0xc20043e4f0, data = 0xa1}, Tag = {
    str = 0x17 <Address 0x17 out of bounds>, len = 2}, Offset = 833223655425, 
  Index =  []int, Anonymous = false}
(gdb) x/2xg &f
0x7fa5700fcb08: 0x0000000000818c00  0x000000c2005730c0
(gdb) x/2xg 0x000000c2005730c0
0xc2005730c0:   0xdeaddeaddeaddead  0x0000000000000008
(gdb)
Seems like the pointer in the interface isn't seen as a pointer by the GC and the block
gets collected.
gopherbot commented 11 years ago

Comment 31:

I created a patch that may (or may not) fix this issue:
https://golang.org/cl/8363045/
Please test.
The patch hasn't been submitted for review to golang-dev yet.
DanielMorsing commented 11 years ago

Comment 32:

Applied the patch. Still getting crashes
DanielMorsing commented 11 years ago

Comment 33:

Here's another gdb session
(gdb) bt
 /* panic frames that happen because of dereferencing 0xa
#7  0x000000000048ae35 in sync.(*Mutex).Lock (m=0xa)
    at /home/daniel/src/go/src/pkg/sync/mutex.go:43
#8  0x00000000005f25fa in labix.org/v2/mgo.(*mongoSocket).Acquire (socket=
    0xa, isMaster=false)
    at /home/daniel/src/go_prj/src/labix.org/v2/mgo/socket.go:151
#9  0x00000000005f168e in labix.org/v2/mgo.(*Session).acquireSocket (s=
    0xc20023f120, slaveOk=true, ~anon1=0x0, ~anon2=...)
    at /home/daniel/src/go_prj/src/labix.org/v2/mgo/session.go:2902
 /* uninteresting frames */
(gdb) f 9
#9  0x00000000005f168e in labix.org/v2/mgo.(*Session).acquireSocket (s=
    0xc20023f120, slaveOk=true, ~anon1=0x0, ~anon2=...)
    at /home/daniel/src/go_prj/src/labix.org/v2/mgo/session.go:2902
2902            s.slaveSocket.Acquire()
(gdb) whatis s
type = struct labix.org/v2/mgo.Session *
(gdb) p s
$1 = (struct labix.org/v2/mgo.Session *) 0xc20023f120
(gdb) x/20xg 0xc20023f120
0xc20023f120:   0x000000c20023f150  0xdeaddeaddeaddead
/* first two words here is an embedded mutex.*/
0xc20023f130:   0x0000000000000005  0x000000c2002d74b0
/* mongocluster pointer and slaveSocket pointer */
0xc20023f140:   0x000000000000000a  0x0000000000000000
/* rest of memory dump for context */
0xc20023f150:   0x000000c20023f180  0xdeaddeaddeaddead
0xc20023f160:   0x0000000000000008  0x000000c20021e540
0xc20023f170:   0x0000000000000011  0x0000000000000000
0xc20023f180:   0x000000c20023f2a0  0xdeaddeaddeaddead
0xc20023f190:   0x0000000000000008  0x000000c2002d74c0
0xc20023f1a0:   0x000000000000000d  0x0000000000000000
0xc20023f1b0:   0x000000c20021e2e0  0x00000000009cf3d0
(gdb) 
Some very weird values happening here.
DanielMorsing commented 11 years ago

Comment 34:

Messed up my analysis of that memory dump. The first 3 quadwords are an RWMutex, 4th is
pointer to mongocluster, 5th is pointer to slavesocket.
The thing I just realized is that the dump means that regular pointers on the stack are
being collected, which points the search away from a potential bug in the GC bytecode.
AFAIK, the GC still treats values on the stack as arbritary pointers, so the only path
through the GC bytecode is going to be defaultProg.
gopherbot commented 11 years ago

Comment 35:

Responding to #34:
There is no relation between the contents of the top of stack at the time of a crash
(accessing an invalid memory address) and the top of stack as seen by GC, unless the
crash happens in runtime·gc() or in a function called from runtime·gc().
DanielMorsing commented 11 years ago

Comment 36:

That's true, but I have a hard time figuring out how else a pointer on the stack gets
pointed to a GC freed MLink block.
I'll try and get some dumps of a crash in the GC.
DanielMorsing commented 11 years ago

Comment 37:

Another annotated gdb session, this time with the fault happening while GCing.
Loading Go Runtime support.
(gdb) bt
#0  runtime.raise (sig=void)
    at /home/daniel/src/go/src/pkg/runtime/sys_linux_amd64.s:85
/* A couple of frames for the dereference panic */
#5  0x000000000040ea1f in scanblock (wbuf=void, wp=void, nobj=void, 
    keepworking=void) at /home/daniel/src/go/src/pkg/runtime/mgc0.c:763
#6  0x0000000000410ad5 in gc (args=void)
    at /home/daniel/src/go/src/pkg/runtime/mgc0.c:1931
/* reflect.Call through here, can't go any further */
(gdb) f 5
#5  0x000000000040ea1f in scanblock (wbuf=void, wp=void, nobj=void, 
    keepworking=void) at /home/daniel/src/go/src/pkg/runtime/mgc0.c:763
763             if(eface->data >= arena_start && eface->data < arena_used) {
/* this isn't the line it fails on. it fails when trying to access the type pointer on
the next line */ 
(gdb) p &eface
/* address of eface pointer on the stack */
$2 = ( *) 0xc2003c47d8
(gdb) x/1gx 0xc2003c47d8
/* address of the eface */
0xc2003c47d8:   0x000000c200649ac0
(gdb) x/2gx 0x000000c200649ac0
/* memory that makes up the eface. the first word is the type pointer that causes the
panic */
0xc200649ac0:   0x0000000000006d00  0x000000c2002bb378
(gdb) 
If anyone has some bright ideas for things to look for, I'd be happy to investigate them.
gopherbot commented 11 years ago

Comment 38:

It may be useful to know which part of the program allocated the memory block
0xc2003c47d8. The start address of this memory block is stored in the variable 'b' in
scanblock().
DanielMorsing commented 11 years ago

Comment 39:

0xc2003c47d8 would be the stack segment running GC. The location of the Eface is
0x000000c200649ac0.
Anyway, the value of b is 0x000000c200649a20. Doing some poking around, and I can't find
anything out of place there.
If anyone has a place to store a 345 mb core file, I'd be happy to upload it.
alberts commented 11 years ago

Comment 40:

Uploading to and sharing it via Google Docs/Drive works pretty well.
DanielMorsing commented 11 years ago

Comment 41:

https://docs.google.com/file/d/0B2YDcoUQXXrUcVdLVlR1LWg2NE0/edit?usp=sharing The core
file
https://docs.google.com/file/d/0B2YDcoUQXXrUR05sS0RTMDFwQ0k/edit?usp=sharing The binary
Compiled with go version devel +1d49ee511d95 Fri Apr 05 15:23:03 2013 -0700 linux/amd64
Not including atoms patch.
DanielMorsing commented 11 years ago

Comment 42:

After Dmitrys patches have landed, I haven't seen a fault after running the test 20
times.
I'm inclined to mark this bug fixed.
adg commented 11 years ago

Comment 43:

We can always re-open it if more failures appear.

Status changed to Fixed.

bradfitz commented 11 years ago

Comment 44:

Issue #5208 has been merged into this issue.

gopherbot commented 11 years ago

Comment 45 by tylor@torbit.com:

The strange panics I was seeing, such as the ones I presented above, have stop
occurring. Dmitry's patches appear to have rectified the issue. Thanks!
daaku commented 11 years ago

Comment 46:

My panics mention in issue #5208 have also stopped occurring. Thanks!