Closed bmpalatiello closed 6 years ago
definitely a question for @jamesblackburn
@jamesblackburn great.
bump! @jamesblackburn
@jamesblackburn
Hi @bmpalatiello - we make an effort with arctic to be reasonably robust with respect to network and other failures. In VersionStore
we write chunks, check that the chunks are available, then write the version document which references the chunks. We've not experienced cases where the version document has been published but chunks are missing (which haven't been bugs in arctic). We relatively aggressively retry operations in the face of network (and other) errors which tend to be indistinguishable from the POV of the client.
A number of the Jepsen torture tests look at data loss and rollbacks in the presence of network failures, system pauses and other torturing of the MongoDB cluster. There are almost certainly issues around failure that we don't handle, or ways to break arctic we haven't thought of. However I suspect similar issues would against our relational databases, and these databases aren't resilient to machine and network failure - we tend to assume that the database is always present and handle failover here much more manually...
The reality is that we haven't faced corruption or other major issues in the 4+ years we have been using arctic on top of Mongo. Our system has hundreds of reads and writes per second 24x7.
If you do encounter issues, or find bugs in the code, we are always keen to know!
While not directly a arctic specific issue, since MongoDB is used in production at AHL, has AHL had or deal with the data errors associated with this article? While it appears to have been fixed here since AHL is heavily dependent on calculating the right signals assuming the data is right, is this issue dealt with or been encountered?
Thank you.