Closed rogpeppe closed 9 years ago
I'm seeing quite a few other similar panics and errors. It looks like some kind of memory corruption. Here's another example. I haven't narrowed down the example yet, but this code has been stable for a while, and it passes in other Go revisions. I will try to bisect tomorrow. Here's another example panic. goroutine 1478 [running]: sync/atomic.CompareAndSwapUint32() /home/rog/go/src/pkg/sync/atomic/asm_amd64.s:14 +0xd sync.(*Mutex).Lock(0xa) /home/rog/go/src/pkg/sync/mutex.go:43 +0x35 labix.org/v2/mgo.(*mongoSocket).Acquire(0xa, 0x0) /home/rog/src/go/src/labix.org/v2/mgo/socket.go:151 +0x2a labix.org/v2/mgo.(*Session).acquireSocket(0xc2006fa870, 0xc200730a01, 0x0, 0x0, 0x0, ...) /home/rog/src/go/src/labix.org/v2/mgo/session.go:2903 +0xce labix.org/v2/mgo.(*Query).One(0xc20052cd90, 0x830400, 0xc2005d5de0, 0x0, 0x0, ...) /home/rog/src/go/src/labix.org/v2/mgo/session.go:1961 +0xae launchpad.net/juju-core/state.(*Settings).Read(0xc2005d5d50, 0x0, 0xc2005d5db0) /home/rog/src/go/src/launchpad.net/juju-core/state/settings.go:193 +0xb2 launchpad.net/juju-core/state.readSettings(0xc200389460, 0xc200290a00, 0x19, 0x17d9f5f40, 0x3000421850, ...) /home/rog/src/go/src/launchpad.net/juju-core/state/settings.go:212 +0xa8 launchpad.net/juju-core/state.(*settingsWatcher).loop(0xc20039a700, 0xc200290a00, 0x19, 0x0, 0x0, ...) /home/rog/src/go/src/launchpad.net/juju-core/state/watcher.go:979 +0x7d launchpad.net/juju-core/state.func·022() /home/rog/src/go/src/launchpad.net/juju-core/state/watcher.go:965 +0x8e created by launchpad.net/juju-core/state.newSettingsWatcher /home/rog/src/go/src/launchpad.net/juju-core/state/watcher.go:966 +0xd3
Comment 3 by tylor@torbit.com:
Could this be the same issue as 5091 ("runtime: unlikely looking panic in append")? https://golang.org/issue/5091 I've been getting weird SIGSEGVs in inconsistent places that I have been unsuccessful in reproducing on demand. Here is a recent example: https://groups.google.com/forum/?fromgroups=#!topic/golang-nuts/uMUmCjMCBkU
Tylor, right now we have no theories as to the cause. If you have repro code we could run or if you could bisect to find a good->bad range would be useful. To clarify, which is the last revision that crashed for you? (Tip has been moving so fast the past few weeks that saying "tip" is nearly useless for debugging this) Thanks!
responding to #1 goroutine 1478 [running]: sync/atomic.CompareAndSwapUint32() /home/rog/go/src/pkg/sync/atomic/asm_amd64.s:14 +0xd sync.(*Mutex).Lock(0xa) /home/rog/go/src/pkg/sync/mutex.go:43 +0x35 labix.org/v2/mgo.(*mongoSocket).Acquire(0xa, 0x0) /home/rog/src/go/src/labix.org/v2/mgo/socket.go:151 +0x2a labix.org/v2/mgo.(*Session).acquireSocket(0xc2006fa870, 0xc200730a01, 0x0, 0x0, 0x0, ...) This looks like *mongoSocket was nil.
Brad, various tip builds as recent as 3bc1ecb89e30. I have some changes that pending that once deployed should allow me to better debugging, but I have attached some panics in hope that they help. Until then I eagerly await Rogers bisect. Thanks!
Attachments:
Progress report: I've been bisecting away and unfortunately the best I've done so far is that the problem (well, a problem anyway, as the symptom is variable) was introduced between revisions 15910 and 16052. In between those revisions I can't build our code because of the bug that was introduced in 15910. I will try to bisect the revisions in between by updating and then manually applying a 16053. I suspect one of the GC-related changes, possibly 15945:37d391191e3f
hmm, I'm not so sure any more. at the exact moment i clicked "Save Changes" then, my looping test runner died after several successful runs with this stack trace. http://paste.ubuntu.com/5676399/ here is the disassembly of scanblock, the function it died inside: http://paste.ubuntu.com/5676404/ re-opening until i can decide that this is not an issue.
Status changed to Accepted.
here's another one: http://paste.ubuntu.com/5676861/ go version e77430da3316
Comment 19 by tylor@torbit.com:
Here is an example of a common crash I am seeing. The GOB encoder object in question is local to the crashing goroutine, so something like a race condition seems very unlikely. This crash has occurred with tip: 3bc1ecb89e30, bb8b458a479c and probably others. I have been unable to reproduce it with simple programs. The problem appears to be on line 35 of encoding/gob/encode.go and involve the free list: 29 func (enc *Encoder) newEncoderState(b *bytes.Buffer) *encoderState { 30 e := enc.freeList 31 if e == nil { 32 e = new(encoderState) 33 e.enc = enc 34 } else { 35 enc.freeList = e.next 36 } 37 e.sendZero = false 38 e.fieldnum = 0 39 e.b = b 40 return e 41 } Here is the actual stack trace: panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal 0xb code=0x1 addr=0x40 pc=0x5a5764] goroutine 99545 [running]: encoding/gob.catchError(0xc2004453a8) /home/user/go/src/pkg/encoding/gob/error.go:38 +0x8d encoding/gob.(*Encoder).newEncoderState(0xc2000d20f0, 0xc200445338, 0x18) /home/user/go/src/pkg/encoding/gob/encode.go:35 +0x64 encoding/gob.(*Encoder).encodeArray(0xc2000d20f0, 0xc200445338, 0xc200625700, 0x8267e0, 0x10, ...) /home/user/go/src/pkg/encoding/gob/encode.go:377 +0x34 encoding/gob.func·013(0x0, 0xc200617910, 0xc2004a80a0) /home/user/go/src/pkg/encoding/gob/encode.go:585 +0xf2 encoding/gob.encodeReflectValue(0xc200617910, 0x682280, 0xc2004a8080, 0x172, 0xc20014ad00, ...) /home/user/go/src/pkg/encoding/gob/encode.go:404 +0x105 encoding/gob.(*Encoder).encodeMap(0xc2006b7a80, 0xc200445338, 0x762260, 0xc200aba450, 0x156, ...) /home/user/go/src/pkg/encoding/gob/encode.go:418 +0x1e0 encoding/gob.func·015(0xc2001d92a0, 0xc2006b7940, 0xc200aba450) /home/user/go/src/pkg/encoding/gob/encode.go:609 +0x18e encoding/gob.(*Encoder).encodeStruct(0xc200445300, 0xc200445338, 0xc20014acc0, 0xc200aba410) /home/user/go/src/pkg/encoding/gob/encode.go:370 +0x118 encoding/gob.func·016(0xc2001cc700, 0xc2006b7900, 0xc200aba410) /home/user/go/src/pkg/encoding/gob/encode.go:618 +0x6d encoding/gob.(*Encoder).encodeStruct(0xc200445300, 0xc200445338, 0xc20014ac80, 0xc200aba410) /home/user/go/src/pkg/encoding/gob/encode.go:370 +0x118 encoding/gob.(*Encoder).encode(0xc200445300, 0xc200445338, 0x736320, 0xc200aba410, 0x196, ...) /home/user/go/src/pkg/encoding/gob/encode.go:746 +0x182 encoding/gob.(*Encoder).EncodeValue(0xc200445300, 0x70ce40, 0xc200aba410, 0x160, 0x0, ...) /home/user/go/src/pkg/encoding/gob/encoder.go:252 +0x387 encoding/gob.(*Encoder).Encode(0xc200445300, 0x70ce40, 0xc200aba410, 0x1, 0x0, ...) /home/user/go/src/pkg/encoding/gob/encoder.go:177 +0x5b xxx/yyy.(*GobEndpointClient).requestStrmWorker(0xc2005a71e0) /home/user/code/go/src/xxx/yyy/GobEndpointClient.go:101 +0x141 created by xxx/yyy.NewGobEndpointClient /home/user/code/go/src/xxx/yyy/GobEndpointClient.go:46 +0xf1 ..... Please let me know if there is any additional information I can supply or steps I should take. (Or if this should be a separate issue) Thanks!
rog's one from comment #18 starts with a GC panic: panic: invalid memory address or nil pointer dereference fatal error: panic during gc [signal 0xb code=0x1 addr=0x2f pc=0x42a693] goroutine 3442 [running]: [fp=0xc2009be5c8] runtime.throw(0xfa3b9c) /home/rog/go-release/src/pkg/runtime/panic.c:473 +0x67 [fp=0xc2009be608] runtime.panicstring(0xfa3908) /home/rog/go-release/src/pkg/runtime/panic.c:484 +0x51 [fp=0xc2009be620] runtime.sigpanic() /home/rog/go-release/src/pkg/runtime/os_linux.c:236 +0x168 [fp=0xc2009be9c0] scanblock(0x7f1e1a678000, 0x7f1e1a6781f8, 0x14, 0xc2009be900) /home/rog/go-release/src/pkg/runtime/mgc0.c:793 +0xa93 [fp=0xc2009bea10] markroot(0xc20010b480, 0x1000000010) /home/rog/go-release/src/pkg/runtime/mgc0.c:1199 +0xab [fp=0xc2009bea88] runtime.parfordo(0xc20010b480) /home/rog/go-release/src/pkg/runtime/parfor.c:105 +0x9b [fp=0xc2009bebb8] gc(0x7f1e18e4b354) /home/rog/go-release/src/pkg/runtime/mgc0.c:1930 +0x29d ----- stack segment boundary ----- [fp=0x7f1e18e4b368] runtime.gc(0xc200000000) /home/rog/go-release/src/pkg/runtime/mgc0.c:1857 +0x11b [fp=0x7f1e18e4b3c0] runtime.mallocgc(0x1000, 0x100000001, 0x600000001) /home/rog/go-release/src/pkg/runtime/zmalloc_linux_amd64.c:101 +0x1e4 [fp=0x7f1e18e4b3f8] makeslice1(0x82a780, 0x1000, 0x1000, 0x7f1e18e4b440) /home/rog/go-release/src/pkg/runtime/slice.c:61 +0x89 [fp=0x7f1e18e4b428] runtime.makeslice(0x82a780, 0x1000, 0x1000, 0xc2004304b0, 0x1000, ...) /home/rog/go-release/src/pkg/runtime/slice.c:34 +0x9a [fp=0x7f1e18e4b4d8] os.(*File).readdirnames(0xc200389a68, 0x1, 0x0, 0x0, 0x0, ...) /home/rog/go-release/src/pkg/os/dir_unix.go:23 +0xb8
I spent some time changing the code so that it no longer uses unsafe (it calls the old unsafe code through an rpc interface), and I managed to trigger a panic again. I saw the same panic as in reply #1 above. @dave cheney: the mongoSocket isn't nil, it really is 0xa. I put this check at the start of the function: if socket == nil { panic("nil socket") } if uintptr(unsafe.Pointer(socket)) < 0x100 { panic(fmt.Errorf("socket is %p", socket)) } and the second panic was triggered, not the first.
some more panics for the collection, all in code with cgo excised, GOMAXPROCS=1, go version e77430da3316, ubuntu 12.04, amd64. The tests pass with the race detector enabled. http://paste.ubuntu.com/5679089/ http://paste.ubuntu.com/5679077/ (these are the same as the panic in comments #1 and #21 above) http://paste.ubuntu.com/5679072/ http://paste.ubuntu.com/5679084/ http://paste.ubuntu.com/5679086/ http://paste.ubuntu.com/5679090/
And another one I haven't seen before: http://paste.ubuntu.com/5679671/
Unfortunately I've no more time to spend on this issue. If anyone else wants to try to reproduce the symptoms, you'll need to be using Ubuntu on amd64. You can get the Go code with "go get launchpad.net/juju-core/worker/uniter". The tests will not run unless you have an SSL-capable mongod binary in your path. You can grab one from http://juju-dist.s3.amazonaws.com/tools/mongo-2.2.0-precise-amd64.tgz . The sha1sum of that tar archive is 608f4a5b84b8a4cfab8484aa90976f051382b957. Once you've done that, "go test launchpad.net/juju-core/worker/uniter" should reproduce the problem about one time in four. If you want to run it with the race detector, you'll need to raise the test timeout (I used 2h though that's overkill) I'm sorry I haven't managed to reduce the amount of code involved. For now, I'm reverting to Go revision 15909.
@daniel to excise cgo from the code, this sequence should work (assuming a single-element GOPATH): cd $GOPATH/launchpad.net/juju-core/worker/uniter bzr pull --overwrite lp:~rogpeppe/juju-core/268-no-cgo mv $GOPATH/launchpad.net/goyaml $GOPATH/launchpad.net/goyaml.orig cd $GOPATH/launchpad.net bzr branch lp:~rogpeppe/+junk/yamlserver goyaml rm -r $GOPATH/pkg # stop the go tool from seeing the old goyaml archives cd goyaml go run server.go & go test please let me know if you have problems - i haven't tested the above. remember to move goyaml.orig back to goyaml when you're done :-)
Here's a little gdb session that should shed some light on the problem. (gdb) bt #0 runtime.raise (sig=void) at /home/daniel/src/go/src/pkg/runtime/sys_linux_amd64.s:85 /* uninteresting frames */ #8 0x0000000000497d89 in fmt.(*pp).fmtString (p=0xc20043e4e0, v= <error reading variable: Address 0xdeaddeaddeaddead out of bounds>, verb= 118, goSyntax=true) at /home/daniel/src/go/src/pkg/fmt/print.go:531 #9 0x000000000049d052 in fmt.(*pp).printReflectValue (p= []uint8 = {...}, value=..., verb=118, plus=false, goSyntax=true, depth=2, wasString=false) at /home/daniel/src/go/src/pkg/fmt/print.go:879 /* a bunch of uninteresting frames */ (gdb) f 9 #9 0x000000000049d052 in fmt.(*pp).printReflectValue (p= []uint8 = {...}, value=..., verb=118, plus=false, goSyntax=true, depth=2, wasString=false) at /home/daniel/src/go/src/pkg/fmt/print.go:879 879 p.fmtString(f.String(), verb, goSyntax) (gdb) whatis f type = reflect.StructField /* actual type is reflect.Value */ (gdb) info locals f Python Exception <type 'exceptions.OverflowError'> signed integer is greater than maximum: Python Exception <type 'exceptions.OverflowError'> signed integer is greater than maximum: f = {Name = , PkgPath = , Type = {tab = 0xc20043e4f0, data = 0xa1}, Tag = { str = 0x17 <Address 0x17 out of bounds>, len = 2}, Offset = 833223655425, Index = []int, Anonymous = false} (gdb) x/2xg &f 0x7fa5700fcb08: 0x0000000000818c00 0x000000c2005730c0 (gdb) x/2xg 0x000000c2005730c0 0xc2005730c0: 0xdeaddeaddeaddead 0x0000000000000008 (gdb) Seems like the pointer in the interface isn't seen as a pointer by the GC and the block gets collected.
I created a patch that may (or may not) fix this issue: https://golang.org/cl/8363045/ Please test. The patch hasn't been submitted for review to golang-dev yet.
Here's another gdb session (gdb) bt /* panic frames that happen because of dereferencing 0xa #7 0x000000000048ae35 in sync.(*Mutex).Lock (m=0xa) at /home/daniel/src/go/src/pkg/sync/mutex.go:43 #8 0x00000000005f25fa in labix.org/v2/mgo.(*mongoSocket).Acquire (socket= 0xa, isMaster=false) at /home/daniel/src/go_prj/src/labix.org/v2/mgo/socket.go:151 #9 0x00000000005f168e in labix.org/v2/mgo.(*Session).acquireSocket (s= 0xc20023f120, slaveOk=true, ~anon1=0x0, ~anon2=...) at /home/daniel/src/go_prj/src/labix.org/v2/mgo/session.go:2902 /* uninteresting frames */ (gdb) f 9 #9 0x00000000005f168e in labix.org/v2/mgo.(*Session).acquireSocket (s= 0xc20023f120, slaveOk=true, ~anon1=0x0, ~anon2=...) at /home/daniel/src/go_prj/src/labix.org/v2/mgo/session.go:2902 2902 s.slaveSocket.Acquire() (gdb) whatis s type = struct labix.org/v2/mgo.Session * (gdb) p s $1 = (struct labix.org/v2/mgo.Session *) 0xc20023f120 (gdb) x/20xg 0xc20023f120 0xc20023f120: 0x000000c20023f150 0xdeaddeaddeaddead /* first two words here is an embedded mutex.*/ 0xc20023f130: 0x0000000000000005 0x000000c2002d74b0 /* mongocluster pointer and slaveSocket pointer */ 0xc20023f140: 0x000000000000000a 0x0000000000000000 /* rest of memory dump for context */ 0xc20023f150: 0x000000c20023f180 0xdeaddeaddeaddead 0xc20023f160: 0x0000000000000008 0x000000c20021e540 0xc20023f170: 0x0000000000000011 0x0000000000000000 0xc20023f180: 0x000000c20023f2a0 0xdeaddeaddeaddead 0xc20023f190: 0x0000000000000008 0x000000c2002d74c0 0xc20023f1a0: 0x000000000000000d 0x0000000000000000 0xc20023f1b0: 0x000000c20021e2e0 0x00000000009cf3d0 (gdb) Some very weird values happening here.
Messed up my analysis of that memory dump. The first 3 quadwords are an RWMutex, 4th is pointer to mongocluster, 5th is pointer to slavesocket. The thing I just realized is that the dump means that regular pointers on the stack are being collected, which points the search away from a potential bug in the GC bytecode. AFAIK, the GC still treats values on the stack as arbritary pointers, so the only path through the GC bytecode is going to be defaultProg.
Another annotated gdb session, this time with the fault happening while GCing. Loading Go Runtime support. (gdb) bt #0 runtime.raise (sig=void) at /home/daniel/src/go/src/pkg/runtime/sys_linux_amd64.s:85 /* A couple of frames for the dereference panic */ #5 0x000000000040ea1f in scanblock (wbuf=void, wp=void, nobj=void, keepworking=void) at /home/daniel/src/go/src/pkg/runtime/mgc0.c:763 #6 0x0000000000410ad5 in gc (args=void) at /home/daniel/src/go/src/pkg/runtime/mgc0.c:1931 /* reflect.Call through here, can't go any further */ (gdb) f 5 #5 0x000000000040ea1f in scanblock (wbuf=void, wp=void, nobj=void, keepworking=void) at /home/daniel/src/go/src/pkg/runtime/mgc0.c:763 763 if(eface->data >= arena_start && eface->data < arena_used) { /* this isn't the line it fails on. it fails when trying to access the type pointer on the next line */ (gdb) p &eface /* address of eface pointer on the stack */ $2 = ( *) 0xc2003c47d8 (gdb) x/1gx 0xc2003c47d8 /* address of the eface */ 0xc2003c47d8: 0x000000c200649ac0 (gdb) x/2gx 0x000000c200649ac0 /* memory that makes up the eface. the first word is the type pointer that causes the panic */ 0xc200649ac0: 0x0000000000006d00 0x000000c2002bb378 (gdb) If anyone has some bright ideas for things to look for, I'd be happy to investigate them.
0xc2003c47d8 would be the stack segment running GC. The location of the Eface is 0x000000c200649ac0. Anyway, the value of b is 0x000000c200649a20. Doing some poking around, and I can't find anything out of place there. If anyone has a place to store a 345 mb core file, I'd be happy to upload it.
https://docs.google.com/file/d/0B2YDcoUQXXrUcVdLVlR1LWg2NE0/edit?usp=sharing The core file https://docs.google.com/file/d/0B2YDcoUQXXrUR05sS0RTMDFwQ0k/edit?usp=sharing The binary Compiled with go version devel +1d49ee511d95 Fri Apr 05 15:23:03 2013 -0700 linux/amd64 Not including atoms patch.
My panics mention in issue #5208 have also stopped occurring. Thanks!