stripe-archive / mosql

MongoDB → PostgreSQL streaming replication
MIT License
1.63k stars 225 forks source link

Cursor Timeout #82

Closed mark-roggenkamp-snapav closed 9 years ago

mark-roggenkamp-snapav commented 9 years ago

I'm getting the following after mosql runs for a while. I'm using 0.4.2 on against Mongo 2.4

/var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/networking.rb:221:in `check_response_flags': Query response returned CURSOR_NOT_FOUND. Either an invalid cursor was specified, or the cursor may have timed out on the server. (Mongo::OperationFailure)
    from /var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/networking.rb:214:in `receive_response_header'
    from /var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/networking.rb:181:in `receive'
    from /var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/networking.rb:144:in `receive_message'
    from /var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/cursor.rb:596:in `send_get_more'
    from /var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/cursor.rb:526:in `refresh'
    from /var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/cursor.rb:513:in `num_remaining'
    from /var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/cursor.rb:187:in `has_next?'
    from /var/lib/gems/1.9.1/gems/mongoriver-0.4.0/lib/mongoriver/tailer.rb:154:in `stream'
    from /var/lib/gems/1.9.1/gems/mongoriver-0.4.0/lib/mongoriver/abstract_persistent_tailer.rb:32:in `stream'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:175:in `optail'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/cli.rb:170:in `run'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/cli.rb:16:in `run'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/bin/mosql:5:in `<top (required)>'
    from /usr/local/bin/mosql:19:in `load'
    from /usr/local/bin/mosql:19:in `<main>'
nelhage commented 9 years ago

Hm. Did your cluster experience a failover or any other such event? Is there anything suspicious in the mongod logs?

I'll dig a bit, but I'll note that if this only happens intermittently, it should be mostly safe to ignore, if you automatically restart mosql -- it's designed to handle crashes and recover automatically and safely.

mark-roggenkamp-snapav commented 9 years ago

No failover or similar that I've found thus far but I'll keep looking. I'll drop a monit config in to keep it going.

I'm pulling from two different RS's and only one has experienced the issue.

The docs say "By default, the server will automatically close the cursor after 10 minutes of inactivity or if client has exhausted the cursor." It's a pretty busy RS so I can't imagine inactivity coming into play under normal circumstances.

I'll keep digging, Thanks.

mark-roggenkamp-snapav commented 9 years ago

I think this is maybe the result of it getting behind and eventually hitting the bounds of the oplog oldest entry.

I'm going to close this. I'll re-open if I find anything indicating something code related.

kungfoox commented 9 years ago

Hey there. So this has been happening quite often for me. We have never been able to resume properly without during reimport. (either we get the same error over and over again or the data from the time it went down until the time it resumes is missing). LMK if there is anything we can help to properly resume?

mark-roggenkamp-snapav commented 9 years ago

In our case it was because mosql was unable to process and store the data fast enough so it slowly falls behind. Once it's trying to process data older than the oplog has available you get this error I believe. One option might be to increase the oplog size if it's some temporary high traffic condition that could be recovered from after the storm is over. Other than that I imagine it's mostly about scaling up the mosql server and/or postgres. The only way to scale mosql out I think would be to split up each collection's processing into a seperate mosql process. There are no options for parallelizing the processing otherwise I don't believe thought I'd love for someone maintaining the project to verify.

kungfoox commented 9 years ago

Thanks for the reply. I think that's exactly what's been happening. We only had 20 min of opslog during our peek hours (<-- not much time to recover). We ended up increasing the opslog size for the secondaries so we'll see how much benefit that gives us. The other problem was connecting to the secondaries though readPreference. For some reason that seems to fail a lot. Instead, we point directly to the secondaries and use slaveOk. Cross fingers, will see if that holds up.