Open taherv opened 5 years ago
Hi @taherv
There's a lot of interesting information here, but it seems highly unlikely that this is a driver issue - one crashed process doesn't have much power over the other process at the driver level. It is possible this is a MongoDB bug as it is the only component that persists any state after the first process crashes.
Are the first steps required to reproduce? Specifically does the MongoDB instance have ot be a replica set? Does it happen with multiple replicas? I'd be helpful to narrow down the test case to the minimal reproducer.
When you say that running Find()
with mgo fails after the above bug is triggered, is this also the case for the mongo shell? Or is it specific to mgo?
Looking forward to the test code, thanks for the report!
Dom
Specifically does the MongoDB instance have to be a replica set?
Yes, you need it to be a replicaSet because changestreams feed from the oplog.
What are the semantics if a mgo driver binary calls .Next() (and is blocked) in one goroutine and calls Close() on the session in another goroutine ? Does : a. Does the .Next() call terminate with error ? b. the Session.Close() block until all cursors have no outstanding activity ? c. Is the database supposed to handle its own cleanup on unclean network disconnects ?
If the mgo driver side aspect is documented, where would I find it ? I would like to understand the design, here.
https://github.com/globalsign/mgo/blob/eeefdecb41b842af6dc652aaea4026e8403e62df/session.go#L2056
// Close terminates the session. It's a runtime error to use a session
// after it has been closed.
func (s *Session) Close() {
s.m.Lock()
if s.mgoCluster != nil {
debugf("Closing session %p", s)
s.unsetSocket()
s.mgoCluster.Release()
s.mgoCluster = nil
}
s.m.Unlock()
}
The session doesn't explicitly try to cleanup operations .... hmmm.
unsetSocket also isn't doing anything fancy https://github.com/globalsign/mgo/blob/eeefdecb41b842af6dc652aaea4026e8403e62df/session.go#L5191
What version of MongoDB are you using (
mongod --version
)?What version of Go are you using (
go version
)?What operating system and processor architecture are you using (
go env
)?What did you do?
BAD Things now start happening:
Things on the database side are not very good .... it will still show one open connection. Wait for upto 10 minutes, the connections still don't drop to 0.
And now for the really BAD part. If you run some other go program using the globalsign driver that does straightforward things like Find() on the same collection above, the thread just hangs.
At this point I declare the database UNHEALTHY. Other points :
It will take me a while to write a test program that does the bare minimum, ask me questions to diagnose further.
I confirm that if I change MaxAwaitTimeMS to something low (like 1 second), then the connections in the db drops down to 0, and I cannot reproduce the subsequent hangs.
I understand go clients are supposed to be closing Sessions and ChangeStreams. But we can't guarantee that will happen if the go program crashes, panic's or is just killed !
Please help !!
Can you reproduce the issue on the latest
development
branch?