test killed after 10min on travis with docker mongo

dvic commented 6 years ago

Hi,

Every once in a while our mongo suite gets killed on TravicCI. We run go 1.10 and use docker for our test suites. Our Postgres and Neo4j test suites run just fine with this setup but with mgo and Mongo we're having these issues.

Stacktrace information can be found below. Any idea why this is happening?

+go test -v -race -coverprofile=coverage.out -covermode=atomic ./...
=== RUN   TestMongoSuiteWithoutCredentials
2018/03/04 13:50:01 CREATING NEW POOL
2018/03/04 13:50:01 POOL CREATED <nil>
2018/03/04 13:50:01 RUNNING MONGO CONTAINER
2018/03/04 13:50:11 MONGO CONTAINER CREATED <nil>
2018/03/04 13:50:11 BEFORE testConnect
2018/03/04 13:50:11 START DialWithTimeout
2018/03/04 13:50:11 MONGO URL = mongodb://localhost:32768
SIGQUIT: quit
PC=0x474643 m=0 sigcode=0

goroutine 31 [syscall]:
runtime.notetsleepg(0x12be9e0, 0x37e09133e, 0x16)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/lock_futex.go:227 +0x42 fp=0xc420052760 sp=0xc420052730 pc=0x422022
runtime.timerproc(0x12be9c0)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/time.go:261 +0x2f9 fp=0xc4200527d8 sp=0xc420052760 pc=0x461889
runtime.goexit()
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4200527e0 sp=0xc4200527d8 pc=0x472bd1
created by runtime.(*timersBucket).addtimerLocked
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/time.go:160 +0x107

goroutine 1 [chan receive]:
testing.(*T).Run(0xc42021c000, 0xd2d7a8, 0x20, 0xd41b80, 0xc4201e5c00)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:825 +0x597
testing.runTests.func1(0xc42021c000)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:1063 +0xa5
testing.tRunner(0xc42021c000, 0xc4201e5d48)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:777 +0x16e
testing.runTests(0xc4201378e0, 0x127b3e0, 0x1, 0x1, 0xc420160800)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:1061 +0x4e2
testing.(*M).Run(0xc420160800, 0x0)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:978 +0x2ce
main.main()
    _testmain.go:90 +0x325

goroutine 19 [syscall]:
os/signal.signal_recv(0x472bd1)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sigqueue.go:139 +0xa6
os/signal.loop()
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/os/signal/signal_unix.go:22 +0x30
created by os/signal.init.0
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/os/signal/signal_unix.go:28 +0x4f

goroutine 20 [semacquire]:
sync.runtime_notifyListWait(0xc42023a6e8, 0xc400000000)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:510 +0x11a
sync.(*Cond).Wait(0xc42023a6d8)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/cond.go:56 +0x8e
github.com/globalsign/mgo.(*mongoCluster).AcquireSocket(0xc42023a6c0, 0x0, 0xc420240a01, 0x6fc23ac00, 0x6fc23ac00, 0x0, 0x0, 0x0, 0x1000, 0x1c5b320, ...)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:644 +0xff
github.com/globalsign/mgo.(*Session).acquireSocket(0xc4202409c0, 0xb9e201, 0x0, 0x0, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:4853 +0x271
github.com/globalsign/mgo.(*Database).Run(0xc42017bc20, 0xc2ee40, 0xda60b0, 0x0, 0x0, 0x0, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:799 +0x5e
github.com/globalsign/mgo.(*Session).Run(0xc4202409c0, 0xc2ee40, 0xda60b0, 0x0, 0x0, 0xcf84e0, 0xc42023a6c0)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:2270 +0xba
github.com/globalsign/mgo.(*Session).Ping(0xc4202409c0, 0xc42023a6c0, 0x6fc23ac00)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:2299 +0x5d
github.com/globalsign/mgo.DialWithInfo(0xc4202c0000, 0x17, 0xc4202c0000, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:563 +0x566
github.com/globalsign/mgo.DialWithTimeout(0xc420026d20, 0x17, 0x6fc23ac00, 0x0, 0xc420167780, 0xc4200b0120)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:304 +0xc3
mongo_test.(*suite).testConnect(0xc42017bf48, 0xc42021c0f0)
    /home/travis/build/qdentity/qdentity/go/src/mongo/mongo_test.go:36 +0xc8
mongo_test.TestMongoSuiteWithoutCredentials(0xc42021c0f0)
    /home/travis/build/qdentity/qdentity/go/src/mongo/mongo_test.go:22 +0x187
testing.tRunner(0xc42021c0f0, 0xd41b80)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:777 +0x16e
created by testing.(*T).Run
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/testing/testing.go:824 +0x565

goroutine 30 [semacquire]:
sync.runtime_notifyListWait(0xc42023a6e8, 0xc400000001)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:510 +0x11a
sync.(*Cond).Wait(0xc42023a6d8)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/cond.go:56 +0x8e
github.com/globalsign/mgo.(*mongoCluster).AcquireSocket(0xc42023a6c0, 0x1, 0xc420240b01, 0x2540be400, 0x2540be400, 0x0, 0x0, 0x0, 0x1000, 0xc420082700, ...)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:644 +0xff
github.com/globalsign/mgo.(*Session).acquireSocket(0xc420240b60, 0xc5f001, 0x0, 0x0, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:4853 +0x271
github.com/globalsign/mgo.(*Database).Run(0xc4200779b8, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:799 +0x5e
github.com/globalsign/mgo.(*Session).Run(0xc420240b60, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x1)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:2270 +0xba
github.com/globalsign/mgo.(*mongoCluster).isMaster(0xc42023a6c0, 0xc4202c20f0, 0xc420232630, 0xc4202c20f0, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:182 +0x258
github.com/globalsign/mgo.(*mongoCluster).syncServer(0xc42023a6c0, 0xc4202c00e0, 0xd, 0xc42001ed20, 0xc4202c00e0, 0xc42023a6c0, 0xc440000000, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:231 +0x434
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1.1(0xc420292060, 0xc420026d2a, 0xd, 0xc420292070, 0xc420026d00, 0xc4202867b0, 0xc42023a6c0, 0xc4202867e0, 0xc420286810, 0x0, ...)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:553 +0x1fb
created by github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:525 +0x175

goroutine 11 [semacquire]:
sync.runtime_Semacquire(0xc42029206c)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc420292060)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/waitgroup.go:129 +0xb3
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration(0xc42023a6c0, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:582 +0x4c5
github.com/globalsign/mgo.(*mongoCluster).syncServersLoop(0xc42023a6c0)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:390 +0x17c
created by github.com/globalsign/mgo.newCluster
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:81 +0x2e3

goroutine 12 [sleep]:
time.Sleep(0x37e11d600)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/time.go:102 +0x146
github.com/globalsign/mgo.(*mongoServer).pinger(0xc4202c00e0, 0x479801)
    /home/travis/gopath/src/github.com/globalsign/mgo/server.go:314 +0x7ad
created by github.com/globalsign/mgo.newServer
    /home/travis/gopath/src/github.com/globalsign/mgo/server.go:89 +0x24b

goroutine 34 [IO wait]:
internal/poll.runtime_pollWait(0x7f50f3494f00, 0x72, 0x128aff0)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/netpoll.go:173 +0x5e
internal/poll.(*pollDesc).wait(0xc420234e18, 0x72, 0xda9f00, 0x128aff0, 0xffffffffffffffff)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/internal/poll/fd_poll_runtime.go:85 +0xe5
internal/poll.(*pollDesc).waitRead(0xc420234e18, 0xc420028800, 0x24, 0x24)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/internal/poll/fd_poll_runtime.go:90 +0x4b
internal/poll.(*FD).Read(0xc420234e00, 0xc420028840, 0x24, 0x24, 0x0, 0x0, 0x0)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/internal/poll/fd_unix.go:157 +0x22a
net.(*netFD).Read(0xc420234e00, 0xc420028840, 0x24, 0x24, 0x4ab9ed, 0xc420234e00, 0x0)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/net/fd_unix.go:202 +0x66
net.(*conn).Read(0xc42000e0c8, 0xc420028840, 0x24, 0x24, 0x0, 0xc4202c24b0, 0xc420062dc0)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/net/net.go:176 +0x85
github.com/globalsign/mgo.fill(0xdb3660, 0xc42000e0c8, 0xc420028840, 0x24, 0x24, 0x0, 0x11)
    /home/travis/gopath/src/github.com/globalsign/mgo/socket.go:567 +0x64
github.com/globalsign/mgo.(*mongoSocket).readLoop(0xc4202c24b0)
    /home/travis/gopath/src/github.com/globalsign/mgo/socket.go:583 +0x15b
created by github.com/globalsign/mgo.newSocket
    /home/travis/gopath/src/github.com/globalsign/mgo/socket.go:197 +0x341

rax    0xfffffffffffffffc
rbx    0x12bb3a0
rcx    0x474643
rdx    0x0
rdi    0x12be9e0
rsi    0x0
rbp    0xc4200526e8
rsp    0xc420052698
r8     0x0
r9     0x0
r10    0xc4200526d8
r11    0x202
r12    0xc420079c80
r13    0x12bb3a0
r14    0xc420001500
r15    0x1a354620
rip    0x474643
rflags 0x202
cs     0x33
fs     0x0
gs     0x0
*** Test killed with quit: ran too long (10m0s).
FAIL    mongo   600.006s

dvic commented 6 years ago

Could it be a problem with the -race flag? We removed the -race flag and up to this point the tests have stopped failing.

KJTsanaktsidis commented 6 years ago

I got a similar (but not identical) deadlock & backtrace when running TestConnectCloseConcurrency. I think the main source of this problem are these two stacks:

goroutine 30 [semacquire]:
sync.runtime_notifyListWait(0xc42023a6e8, 0xc400000001)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:510 +0x11a
sync.(*Cond).Wait(0xc42023a6d8)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/cond.go:56 +0x8e
github.com/globalsign/mgo.(*mongoCluster).AcquireSocket(0xc42023a6c0, 0x1, 0xc420240b01, 0x2540be400, 0x2540be400, 0x0, 0x0, 0x0, 0x1000, 0xc420082700, ...)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:644 +0xff
github.com/globalsign/mgo.(*Session).acquireSocket(0xc420240b60, 0xc5f001, 0x0, 0x0, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:4853 +0x271
github.com/globalsign/mgo.(*Database).Run(0xc4200779b8, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:799 +0x5e
github.com/globalsign/mgo.(*Session).Run(0xc420240b60, 0xc5f0c0, 0xc42000d200, 0xc10ec0, 0xc420232630, 0x0, 0x1)
    /home/travis/gopath/src/github.com/globalsign/mgo/session.go:2270 +0xba
github.com/globalsign/mgo.(*mongoCluster).isMaster(0xc42023a6c0, 0xc4202c20f0, 0xc420232630, 0xc4202c20f0, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:182 +0x258
github.com/globalsign/mgo.(*mongoCluster).syncServer(0xc42023a6c0, 0xc4202c00e0, 0xd, 0xc42001ed20, 0xc4202c00e0, 0xc42023a6c0, 0xc440000000, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:231 +0x434
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1.1(0xc420292060, 0xc420026d2a, 0xd, 0xc420292070, 0xc420026d00, 0xc4202867b0, 0xc42023a6c0, 0xc4202867e0, 0xc420286810, 0x0, ...)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:553 +0x1fb
created by github.com/globalsign/mgo.(*mongoCluster).syncServersIteration.func1
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:525 +0x175

and

goroutine 11 [semacquire]:
sync.runtime_Semacquire(0xc42029206c)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc420292060)
    /home/travis/.gimme/versions/go1.10.linux.amd64/src/sync/waitgroup.go:129 +0xb3
github.com/globalsign/mgo.(*mongoCluster).syncServersIteration(0xc42023a6c0, 0x0)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:582 +0x4c5
github.com/globalsign/mgo.(*mongoCluster).syncServersLoop(0xc42023a6c0)
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:390 +0x17c
created by github.com/globalsign/mgo.newCluster
    /home/travis/gopath/src/github.com/globalsign/mgo/cluster.go:81 +0x2e3

As near as I can tell....

Goroutine 11 has the syncServersLoop, which loops every few hundred ms and checks the topology of the cluster.
syncServersLoop calls syncServersIteration to do its actual work on every pump of the loop
syncServersIteration spawns a new goroutine 30 and blocks goroutine 11 waiting for 30 on a sync.waitGroup
The anonymous function in syncServersIteration calls cluster.syncServer() to probe it and add it to the cluster.masters and cluster.servers slices.
cluster.syncServer explicitly opens a socket to this particular server with a call to server.AcquireSocket (as opposed to opening a socket to any server in the cluster)
cluster.syncServer calls server.isMaster() with this socket, to ask if the server is a replset master
isMaster creates a new session and explicitly assigns the passed-in socket to it. It prepares a command and then attempts to execute it with session.Run
This eventually falls in to Database.Run(), which calls session.acquireSocket()
acquireSocket() should be a no-op, since the isMaster call a few frames above explicitly set s.setSocket. However, it apparently fails the checks that s.masterSocket != nil && s.masterSocket.dead == nil or s.slaveSocket != nil && s.slaveSocket.dead == nil && s.slaveOk && slaveOk && (s.masterSocket == nil || s.consistency != PrimaryPreferred && s.consistency != Monotonic), and thus falls into s.cluster().AcquireSocket(). THIS is I believe the bug; the code higher up the stack is trying to call isMaster on a particular server, but this is going to get a connection to any arbitrary server matching the tags.
AcquireSocket looks for a server in its understanding of the topology by checking cluster.masters.Len() and cluster.servers.Len(). However, the cluster discovery hasn't actually run yet - syncServersIteration (further up our call stack in this goroutine) is supposed to populate those collections with a call to cluster.addServer(), but it needs to finish its call to syncServer/isMaster first.
Since the cluster topology isn't populated yet, AcquireSocket attempts to poke the syncServers loop on goroutine 11 by calling cluster.syncServers which just writes to a channel. This is actually a total no-op because both sides of the channel are read/written to nonblocking and the data is just a signal, but this is a different bug and not the actual issue.
AcquireSocket then waits on the condition variable cluster.serverSynced.Wait().
BUT, that condition variable is broadcast from three places:
- syncServersLoop, which is not iterating at the moment because goroutine 11 is blocked on the waitgroup in syncServersIteration
- addServer and syncServer, both of which are only called from syncServersIteration, which we are blocking on goroutine 30
- Thus, we have a deadlock.

phew. That was fun.

I'm pretty sure the bug is that isMaster is using session.setSocket to ensure that the command with Run is run against the right server, but if something is wrong with the socket, instead of passing an error up to isMaster, Run calls acquireSocket which just attempts to make a new socket to any random server in the cluster. The deadlock is not a code path that should ever be made to work, I think.

Thoughts?

domodwyer commented 6 years ago

Hi @dvic and @KJTsanaktsidis

First off - @dvic thanks for the solid report, and @KJTsanaktsidis thanks for diving deeper into mgo than is good for your sanity!

We'll take a look at this - we've never seen any deadlocks ourselves but the possibility is definitely there - there's an amazing amount of interplay with the locks (as @KJTsanaktsidis can clearly attest!) Do either of you have any reproducing code we can look at?

Dom

KJTsanaktsidis commented 6 years ago

I’ll have a look and see if I can find a solid reproduction next week - maybe a “mongo” server that accepts then closes all connections might trigger this code path?

KJTsanaktsidis commented 6 years ago

@domodwyer I think I've managed to provide a repro in https://github.com/globalsign/mgo/pull/121 - the test in the first commit fails about 20% of the time when i run it with go test -check.v -check.f "S.TestNoDeadlockOnClose" -timeout 25s on my machine.

domodwyer commented 6 years ago

Hi @dvic

We're going to merge #121 into development ASAP (thanks to @KJTsanaktsidis !) and cut a hotfix to master once it's tested. In the meantime would you be able to run your tests using the development mgo branch to check if it resolves this issue?

Dom

dvic commented 6 years ago

Hi @domodwyer, sure no problem. Thanks! Will try it now and get back to you.

domodwyer commented 6 years ago

Hey @dvic

It's not merged just yet - I'll post here when it's done 👍

Dom

dvic commented 6 years ago

No problem, for now I just used https://github.com/zendesk/mgo/tree/fix_dial_deadlock directly, TravisCI is running.. 🤞

dvic commented 6 years ago

Good news: I ran the test suite three times now, each passed without problems 👍 I'll keep them running just to be sure and I can also run it a few times on the dev branch once you're ready.

dvic commented 6 years ago

@domodwyer Tests keep passing, #121 definitely seems to solve the problem (for me at least). Let me know if you want me to perform additional test runs on the dev branch.

domodwyer commented 6 years ago

This is great news - thanks @dvic for reporting and @KJTsanaktsidis for such a comprehensive analysis and fix! Open source communities are alive and well! 👍

I will close this after the hotfix - thanks a lot!

Dom

KJTsanaktsidis commented 6 years ago

Really happy to help - having this library be actively maintained helps everyone!

domodwyer commented 6 years ago

Hi @dvic, @KJTsanaktsidis

Sorry for disappearing, I was out the country! It looks like this has been fixed (thanks!) but with a direct push to development so this didn't close (I'll also find out how that happened - it should be PR only) so closing now.

I will cut a hotfix release after a test run - thanks again!

Dom

globalsign / mgo

test killed after 10min on travis with docker mongo #120