Queries hang after crashed ChangeStream consumer

taherv commented 5 years ago

What version of MongoDB are you using (`mongod --version`)?

3.6.5

What version of Go are you using (`go version`)?

1.9.2

What operating system and processor architecture are you using (`go env`)?

<go env here>

What did you do?

Start a mongodb server (docker run -ti --net=host mongo --replSet rs0).
Login to mongo using the shell, and convert into a 1 replica replicaSet (docker run -ti --net=host mongo mongo --eval "rs.initiate()")
Now start a go program that does the following using the globalsign driver : a. Open a session, create an empty collection and initiate a ChangeStream. IMPORTANT: Use a very high MaxAwaitTimeMS cs, err := collection.Watch(pipeline, mgo.ChangeStreamOptions{ FullDocument: "updateLookup", MaxAwaitTimeMS: 360 * time.Hour})
The go program will wait for changestreamevents, and is blocked in a call to Next() on the changestream.
Kill this program while it is blocked in Next(). (e.g. using "kill -9 ")

BAD Things now start happening:

Things on the database side are not very good .... it will still show one open connection. Wait for upto 10 minutes, the connections still don't drop to 0.
And now for the really BAD part. If you run some other go program using the globalsign driver that does straightforward things like Find() on the same collection above, the thread just hangs.

At this point I declare the database UNHEALTHY. Other points :

I have not tried running the second program on a different collection, or a different database.

It will take me a while to write a test program that does the bare minimum, ask me questions to diagnose further.

I confirm that if I change MaxAwaitTimeMS to something low (like 1 second), then the connections in the db drops down to 0, and I cannot reproduce the subsequent hangs.

I understand go clients are supposed to be closing Sessions and ChangeStreams. But we can't guarantee that will happen if the go program crashes, panic's or is just killed !

Please help !!

Can you reproduce the issue on the latest `development` branch?

domodwyer commented 5 years ago

Hi @taherv

There's a lot of interesting information here, but it seems highly unlikely that this is a driver issue - one crashed process doesn't have much power over the other process at the driver level. It is possible this is a MongoDB bug as it is the only component that persists any state after the first process crashes.

Are the first steps required to reproduce? Specifically does the MongoDB instance have ot be a replica set? Does it happen with multiple replicas? I'd be helpful to narrow down the test case to the minimal reproducer.

When you say that running Find() with mgo fails after the above bug is triggered, is this also the case for the mongo shell? Or is it specific to mgo?

Looking forward to the test code, thanks for the report!

Dom

taherv commented 5 years ago

Specifically does the MongoDB instance have to be a replica set?

Yes, you need it to be a replicaSet because changestreams feed from the oplog.

What are the semantics if a mgo driver binary calls .Next() (and is blocked) in one goroutine and calls Close() on the session in another goroutine ? Does : a. Does the .Next() call terminate with error ? b. the Session.Close() block until all cursors have no outstanding activity ? c. Is the database supposed to handle its own cleanup on unclean network disconnects ?

If the mgo driver side aspect is documented, where would I find it ? I would like to understand the design, here.

taherv commented 5 years ago

https://github.com/globalsign/mgo/blob/eeefdecb41b842af6dc652aaea4026e8403e62df/session.go#L2056

// Close terminates the session.  It's a runtime error to use a session
// after it has been closed.
func (s *Session) Close() {
    s.m.Lock()
    if s.mgoCluster != nil {
        debugf("Closing session %p", s)
        s.unsetSocket()
        s.mgoCluster.Release()
        s.mgoCluster = nil
    }
    s.m.Unlock()
}

The session doesn't explicitly try to cleanup operations .... hmmm.

unsetSocket also isn't doing anything fancy https://github.com/globalsign/mgo/blob/eeefdecb41b842af6dc652aaea4026e8403e62df/session.go#L5191

globalsign / mgo