stripe-archive / mosql

MongoDB → PostgreSQL streaming replication
MIT License
1.63k stars 225 forks source link

Batch inserts #68

Open macobo opened 9 years ago

macobo commented 9 years ago

This pull request adds support for batching sequential INSERTs when doing tailing, speeding up tailing under certain conditions while not being slower than the current state. See also issue #47.

r? @nelhage cc @snoble

Basic strategy is to batch consecutive inserts together per namespace. Batch gets saved whenever:

Some handwavy measurements for tailing 20000 oplog entries:


Notes on potential future work (That I may or may not be working on soonish):

The next "low hanging" performance fruit to work on after this would be to optimize updates, though this wouldn't have this large of an effect.

Some ideas on how can be done: $set entries in oplog can directly be translated into postgres queries only updating those columns mentioned. Updates without $set can replace the current row in postgres with the data in oplog entry. Tricky part here is figuring out if/how this applies to tokumx even after mongoriver does oplog entry translation (if they support any other $ operations) and unset.

Another performance improvement would be to have multiple tailers in either separate threads or processes, separated by namespace. This would however keeping multiple tailing states in database (one per namespace) and I'm not quite sure what the performance implications are for mongo for querying the same oplog (with filters?) from multiple processes.

nelhage commented 9 years ago

Modulo the concerns around making sure we don't update timestamps too early, I think this lgtm.

barretod commented 8 years ago

Did you figure out how to address the concerns around timestamps? We really need this optimization in our environment.