MongoDB data loss - Githubissues

bmpalatiello commented 7 years ago

While not directly a arctic specific issue, since MongoDB is used in production at AHL, has AHL had or deal with the data errors associated with this article? While it appears to have been fixed here since AHL is heavily dependent on calculating the right signals assuming the data is right, is this issue dealt with or been encountered?

Thank you.

bmoscon commented 7 years ago

definitely a question for @jamesblackburn

bmpalatiello commented 7 years ago

@jamesblackburn great.

cmorgan commented 7 years ago

bump! @jamesblackburn

bmoscon commented 6 years ago

@jamesblackburn

jamesblackburn commented 6 years ago

Hi @bmpalatiello - we make an effort with arctic to be reasonably robust with respect to network and other failures. In VersionStore we write chunks, check that the chunks are available, then write the version document which references the chunks. We've not experienced cases where the version document has been published but chunks are missing (which haven't been bugs in arctic). We relatively aggressively retry operations in the face of network (and other) errors which tend to be indistinguishable from the POV of the client.

A number of the Jepsen torture tests look at data loss and rollbacks in the presence of network failures, system pauses and other torturing of the MongoDB cluster. There are almost certainly issues around failure that we don't handle, or ways to break arctic we haven't thought of. However I suspect similar issues would against our relational databases, and these databases aren't resilient to machine and network failure - we tend to assume that the database is always present and handle failover here much more manually...

The reality is that we haven't faced corruption or other major issues in the 4+ years we have been using arctic on top of Mongo. Our system has hundreds of reads and writes per second 24x7.

If you do encounter issues, or find bugs in the code, we are always keen to know!

man-group / arctic

MongoDB data loss #333