Add support for more different kinds of faults

symbiont-stevan-andjelkovic commented 3 years ago

The faults we currently support are:

Message omissions (which is enough to simulate partitions);
Fail-stop failures, i.e. permanent crashes of nodes.

These are the same faults as used in the linage-driven fault injection papers.

We also partially support:

Messages being delayed (latency) or reordered.

By the nature of the scheduler choosing arrival times of messages (note however that these faults are introduced randomly via the seed of the run and not subject to lineage-driven optimisations).

List of other crash faults we'd like to support:

[ ] Crash-recovery failures, i.e. restarting nodes (which involve both a window of message loss and the loss of ephemeral state);
- [ ] crash-recovery without losing the disk (only resets the heap/in-memory data);
- [ ] crash-recovery with losing the disk (nukes the database);
- [ ] crash-recovery where there's a delay between the crash and the recovery, currently the two happen in the same discrete time step (technically this is partially covered because a pause fault can happen right after the crash).
[x] Pausing nodes (simulating long garbage collection or I/O pauses);
[x] Time skews;
[ ] Message duplication;
[ ] Network topology change (nodes joining and leaving, arguably not a fault per say);
[ ] Restricted bandwidth;
[ ] Filesystem failures (fsyncs not happening before crash/restart), c.f.:
- https://github.com/ligurio/unreliablefs ;
- https://www.usenix.org/conference/atc20/presentation/rebello and https://github.com/WiscADSL/cuttlefs ;
- https://www.usenix.org/conference/fast18/presentation/alagappan (for blogpost summary see: https://blog.acolyer.org/2018/02/27/protocol-aware-recovery-for-consensus-based-storage/) ;
- gray failures (latency spikes): https://www.microsoft.com/en-us/research/publication/gray-failure-achilles-heel-cloud-scale-systems/:
- https://www.cockroachlabs.com/blog/pebble-rocksdb-kv-store/ (see crash testing section).

There's also many byzantine faults one can think of, which basically boil down to:

Arbitrarily change the state of a node at any time;
Arbitrarily change a message between two nodes while it's in transit.

For most of the faults above we know how to introduce them in a random fashion, the tricky part however is to figure out how they interact with the lineage-driven optimisation though.

symbiont-tom-tantillo commented 3 years ago

💯

symbiont-wayne-collier commented 3 years ago

Thanks for this summary, Stevan!

platonic-io / detsys-testkit

Add support for more different kinds of faults #154