Closed dvic closed 6 years ago
Could it be a problem with the -race
flag? We removed the -race
flag and up to this point the tests have stopped failing.
I got a similar (but not identical) deadlock & backtrace when running TestConnectCloseConcurrency
. I think the main source of this problem are these two stacks:
goroutine 30 [semacquire]:
sync.runtime_notifyListWait(0xc42023a6e8, 0xc400000001)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:510 +0x11a
sync.(*Cond).Wait(0xc42023a6d8)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/cond.go:56 +0x8e
github.com/globalsign/mgo.(*mongoCluster).AcquireSocket(0xc42023a6c0, 0x1, 0xc420240b01, 0x2540be400, 0x2540be400, 0x0, 0x0, 0x0, 0x1000, 0xc420082700, ...)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:644 +0xff
github.com/globalsign/mgo.(*Session).acquireSocket(0xc420240b60, 0xc5f001, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:4853 +0x271
github.com/globalsign/mgo.(*Database).Run(0xc4200779b8, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:799 +0x5e
github.com/globalsign/mgo.(*Session).Run(0xc420240b60, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x1)
/home/travis/gopath/src/github.com/globalsign/mgo/session.go:2270 +0xba
github.com/globalsign/mgo.(*mongoCluster).isMaster(0xc42023a6c0, 0xc4202c20f0, 0xc420232630, 0xc4202c20f0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:182 +0x258
github.com/globalsign/mgo.(*mongoCluster).syncServer(0xc42023a6c0, 0xc4202c00e0, 0xd, 0xc42001ed20, 0xc4202c00e0, 0xc42023a6c0, 0xc440000000, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:231 +0x434
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1.1(0xc420292060, 0xc420026d2a, 0xd, 0xc420292070, 0xc420026d00, 0xc4202867b0, 0xc42023a6c0, 0xc4202867e0, 0xc420286810, 0x0, ...)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:553 +0x1fb
created by github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:525 +0x175
and
goroutine 11 [semacquire]:
sync.runtime_Semacquire(0xc42029206c)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc420292060)
/home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/waitgroup.go:129 +0xb3
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration(0xc42023a6c0, 0x0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:582 +0x4c5
github.com/globalsign/mgo.(*mongoCluster).syncServersLoop(0xc42023a6c0)
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:390 +0x17c
created by github.com/globalsign/mgo.newCluster
/home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:81 +0x2e3
As near as I can tell....
syncServersLoop
, which loops every few hundred ms and checks the topology of the cluster.syncServersLoop
calls syncServersIteration
to do its actual work on every pump of the loopsyncServersIteration
spawns a new goroutine 30 and blocks goroutine 11 waiting for 30 on a sync.waitGroup
syncServersIteration
calls cluster.syncServer()
to probe it and add it to the cluster.masters
and cluster.servers
slices.cluster.syncServer
explicitly opens a socket to this particular server with a call to server.AcquireSocket
(as opposed to opening a socket to any server in the cluster)cluster.syncServer
calls server.isMaster()
with this socket, to ask if the server is a replset masterisMaster
creates a new session and explicitly assigns the passed-in socket to it. It prepares a command and then attempts to execute it with session.Run
Database.Run()
, which calls session.acquireSocket()
acquireSocket()
should be a no-op, since the isMaster
call a few frames above explicitly set s.setSocket
. However, it apparently fails the checks that s.masterSocket != nil && s.masterSocket.dead == nil
or s.slaveSocket != nil && s.slaveSocket.dead == nil && s.slaveOk && slaveOk && (s.masterSocket == nil || s.consistency != PrimaryPreferred && s.consistency != Monotonic)
, and thus falls into s.cluster().AcquireSocket()
. THIS is I believe the bug; the code higher up the stack is trying to call isMaster
on a particular server, but this is going to get a connection to any arbitrary server matching the tags.AcquireSocket
looks for a server in its understanding of the topology by checking cluster.masters.Len()
and cluster.servers.Len()
. However, the cluster discovery hasn't actually run yet - syncServersIteration
(further up our call stack in this goroutine) is supposed to populate those collections with a call to cluster.addServer()
, but it needs to finish its call to syncServer
/isMaster
first.AcquireSocket
attempts to poke the syncServers
loop on goroutine 11 by calling cluster.syncServers
which just writes to a channel. This is actually a total no-op because both sides of the channel are read/written to nonblocking and the data is just a signal, but this is a different bug and not the actual issue.AcquireSocket
then waits on the condition variable cluster.serverSynced.Wait()
.syncServersLoop
, which is not iterating at the moment because goroutine 11 is blocked on the waitgroup in syncServersIteration
addServer
and syncServer
, both of which are only called from syncServersIteration
, which we are blocking on goroutine 30phew. That was fun.
I'm pretty sure the bug is that isMaster
is using session.setSocket
to ensure that the command with Run
is run against the right server, but if something is wrong with the socket, instead of passing an error up to isMaster
, Run
calls acquireSocket
which just attempts to make a new socket to any random server in the cluster. The deadlock is not a code path that should ever be made to work, I think.
Thoughts?
Hi @dvic and @KJTsanaktsidis
First off - @dvic thanks for the solid report, and @KJTsanaktsidis thanks for diving deeper into mgo than is good for your sanity!
We'll take a look at this - we've never seen any deadlocks ourselves but the possibility is definitely there - there's an amazing amount of interplay with the locks (as @KJTsanaktsidis can clearly attest!) Do either of you have any reproducing code we can look at?
Dom
I’ll have a look and see if I can find a solid reproduction next week - maybe a “mongo” server that accepts then closes all connections might trigger this code path?
@domodwyer I think I've managed to provide a repro in https://github.com/globalsign/mgo/pull/121 - the test in the first commit fails about 20% of the time when i run it with go test -check.v -check.f "S.TestNoDeadlockOnClose" -timeout 25s
on my machine.
Hi @dvic
We're going to merge #121 into development ASAP (thanks to @KJTsanaktsidis !) and cut a hotfix to master once it's tested. In the meantime would you be able to run your tests using the development mgo branch to check if it resolves this issue?
Dom
Hi @domodwyer, sure no problem. Thanks! Will try it now and get back to you.
Hey @dvic
It's not merged just yet - I'll post here when it's done 👍
Dom
No problem, for now I just used https://github.com/zendesk/mgo/tree/fix_dial_deadlock directly, TravisCI is running.. 🤞
Good news: I ran the test suite three times now, each passed without problems 👍 I'll keep them running just to be sure and I can also run it a few times on the dev branch once you're ready.
@domodwyer Tests keep passing, #121 definitely seems to solve the problem (for me at least). Let me know if you want me to perform additional test runs on the dev branch.
This is great news - thanks @dvic for reporting and @KJTsanaktsidis for such a comprehensive analysis and fix! Open source communities are alive and well! 👍
I will close this after the hotfix - thanks a lot!
Dom
Really happy to help - having this library be actively maintained helps everyone!
Hi @dvic, @KJTsanaktsidis
Sorry for disappearing, I was out the country! It looks like this has been fixed (thanks!) but with a direct push to development so this didn't close (I'll also find out how that happened - it should be PR only) so closing now.
I will cut a hotfix release after a test run - thanks again!
Dom
Hi,
Every once in a while our mongo suite gets killed on TravicCI. We run go 1.10 and use docker for our test suites. Our Postgres and Neo4j test suites run just fine with this setup but with mgo and Mongo we're having these issues.
Stacktrace information can be found below. Any idea why this is happening?