mongodb-partners / mongo-rocks

MongoDB storage integration layer for the Rocks storage engine
402 stars 100 forks source link

Random failures of 'CheckReplOplog' in *_passthrough tests suites #108

Open igorsol opened 7 years ago

igorsol commented 7 years ago

Hello,

This issue affects all passthrough test suites. Here is the list (probably incomplete):

This issue also affects both 3.4 and master branches. You can see example failures here (see the red lines in the left column on the pages):

Our investigation has shown that this issue is caused by this commit: https://github.com/mongodb-partners/mongo-rocks/commit/422336 by @AdallomRoy

We also have a possible fix here: https://github.com/mongodb-partners/mongo-rocks/pull/107 We test this fix now and so far our tests are good but it would be great to have more opinions and reviews.

AdallomRoy commented 7 years ago

Hi Igor, Great catch! Doesn't this patch basically remove the optimization for backward cursors? or am I missing something?

Roy.

igorcanadi commented 7 years ago

@AdallomRoy Yes, I think this just removes the optimization for the backward cursors. However, master branch of mongo reimplements oplog visibility rules significantly, so this code-path will require big rewrite soon anyway. See https://github.com/mongodb-partners/mongo-rocks/issues/106

AdallomRoy commented 7 years ago

Got it. Probably the better fix would be that the backward cursor would check if the record is currently visible in terms of oplog visibility... but not sure it's critical. Can you also merge it to v3.2? Thanks!

igorcanadi commented 7 years ago

Done

igorsol commented 7 years ago

Hi Roy and Igor,

You are right, my fix disabled optimisation for backward cursors. But since we still want this optimisation for v3.4 and v3.2 I did additional debugging and found what is going wrong with backward cursors:

During the tests sometimes save()/restore() sequence is called on just created backward cursor BEFORE first call to next(). This causes lost of _skipNextAdvance state which should be true in case of oplog first record retrieval hack. Thus when next() is called redundant advance happens and cursor loses one document.

The fix for this is here: https://github.com/igorsol/mongo-rocks/commit/031488eca37f8e4e3e47fa3f2a860478d28d2e87 It re-enables first record retrieval hack for backward cursors and fixes CheckReplOplog issues. I will create pull request after additional tests.

igorcanadi commented 7 years ago

Good find, thanks @igorsol !

igorsol commented 7 years ago

Our testing has finally finished. Pull request is here: https://github.com/mongodb-partners/mongo-rocks/pull/111