platonic-io / detsys-testkit

A test kit for fast and deterministic system tests.
MIT License
6 stars 1 forks source link

Estimate the search space #175

Open symbiont-stevan-andjelkovic opened 3 years ago

symbiont-stevan-andjelkovic commented 3 years ago

In order to display progress and to gauge how well our lineage-driven optimisations are working we should try to estimate the search space.

For a first draft, that also includes notes about the tricky parts, see #174.

symbiont-stevan-andjelkovic commented 3 years ago

In PR #184 we added an explanation of how to estimate the random search fault space given a failure specification, given one concrete test run.

There are several ways to improve on this.

For starters we need to figure out how many interesting concrete programs there are to being with. Typically property based testing focuses on this aspect, i.e. generate random test cases, but because in distributed systems a very small test case, e.g. a write followed by a read, has so much non-determinism this aspect is less important than exploring the non-determinism.

It would also be interesting to estimate what the fault space is if the random search is augmented with information about the network trace (i.e. which events actually happened in the concrete run) and memory of which faults it already tried in previous runs. This would make it closer to what lineage-driven fault injection (LDFI) does, but with very simple constraints (don't try what you've already tried, as opposed to try to eliminate possibilities based on what we know). The tricky part here is that as we introduce faults, they might trigger retries which spawn new messages that were not in the original network trace. We could perhaps make an estimate given that no new messages are spawned at the start of the test, and then in addition calculate how many new messages are spawned per run on average and how much they enlarge the fault space as we run the test.

To make the estimate closer to what actually happens during LDFI we would need to take all the constraints it produces into account, which seems hard. A possible alternative is to get some statistics from the SAT solver itself about its estimated search space? This would enable us to compare how efficient LDFI is verses random search or even one version of LDFI vs another with slightly tweaked constraints.

It would also be interesting, but hard, to take symmetries or equivalence classes of states into account, e.g. doesn't it really matter which of the 4 follower nodes we choose to crash?

A simpler problem would be to run a sweep search where we start with a small failure specification and iteratively make it bigger until we reach some timeout, and basically see how far we get as a measure of how efficient it is for a bug free program. By analogy this would be closer to a benchmark, while properly estimating the search space is closer to calculating the complexity of the algorithm.

symbiont-stevan-andjelkovic commented 3 years ago

For an explanation of how LDFI cuts down the fault space and how to estimate the number of runs needed see #196.

symbiont-stevan-andjelkovic commented 3 years ago

For starters we need to figure out how many interesting concrete programs there are to being with.

One possible way to compute this would be to: identify test case equivalence classes up to a certain depth, e.g. write(k1, v1) || write(k2, v2); ... is equivalent to write(k2, v2) || write(k1, v1); .... This can perhaps be done by generating a new random test case, run it to produce a history and a model, hash the response part of the history and the model, repeat, all test cases with the same hash are in the same equivalence class.