Conformance tests organization

andrey-kuprianov commented 3 years ago

With the model-based tests starting to replace standard static tests for Tendermint-rs, we need to decide on how to organize conformance / integration tests for this repository in particular, and conformance tests for Tendermint protocols in general.

For that I am going to write my initial thoughts here, maybe also later as an RFC, and hope for a broader discussion with anyone interested.

In short, how I see it, we have three ways on how to approach conformance testing:

Static tests

This is what we have at the moment. Tests are JSON files, that can be fed into some test driver (example: tests, driver)

+ simplistic structure, easy to understand for someone from the first sight + no non-determinism in the test -> conformance across protocol implementations / test instantiations + fast test execution - hard to generate: the generator is in Go, each test requires manual coding, the generator code is fragile wrt. protocol changes - hard to understand what particular test does - hard to maintain: as v.0.33 -> v.034 transition shows, sometimes all tests need to be reworked manually, lots of pain (see #563) - to cover all aspects of functionality, we will need lots of them (hundreds) This multiplies with the previous point

Fully dynamic model-based tests

Here the test is a simple TLA+ assertion (examples). A TLA+ test is fed into Apalache model checker, and then transformed into a static test similar to the static tests above; all of that completely automatically (auto-generated tests, driver)

+ very simple tests + easy to understand the test intention + relatively easy to maintain: in most cases no changes to the tests are needed when changing the datastructures + inherent non-determinism: a single TLA+ test can result in different static tests each time, due to the non-determinism in the underlying SMT solver. This allows to test various concrete scenarios from the same abstract test - inherent non-determinism: no conformance across protocol implementations / test instantiations - slow test execution (involves running the model checker) - abstract TLA+ tests still do not cover all necessary scenarios: e.g., they can't be used to test implementations wrt. malformed datastructures.

As can be observed, in the above, non-determinism is both good and bad. We need a lot of artificially introduced non-determinism to cover all possible scenarios, at the same time we need to have no non-determinism for the final conformance tests. To introduce additional non-determinism at the level of datastructures we are currently working towards a fuzzing approach (#564). In the Testgen Fuzzer we will use controlled (seedable) non-determinism to fuzz (mutate) datastructures obtained from the model-based tests, to trigger failures processing in the implementation.

For conformance tests the non-determinism needs to be brought to zero: each conformance test needs a human approval to be sure that the test indeed conveys the desired scenario.

Hybrid tests

Here, we resolve all non-determinism resulting from the model checker beforehand, and use the generated TLA+ executions (example). Each such semi-abstract test will then be combined with a set of random values (seeds), each seed describing the particular faults that will be introduced into the semi-abstract test when translating it into the concrete one. The semi-abstract tests, together with the seed bags, can then be backed into each Tendermint Testgen release, and the release binary will be able to generate static conformance tests on the fly.

+ simple tests; can be tracked back to the original TLA+ assertions + easy to understand the test intention + relatively easy to maintain: in most cases no changes to the tests are needed when changing the datastructures + the non-determinism from the model checker is already resolved: multiple semi-abstract tests from the same abstract one will examine various concrete scenarios + the additional non-determinism from fuzzing will be resolved by seeds bags: conformance across protocol implementations / test instantiations + fast test execution + the conformance tests will get versioned: each next Testgen release will bring the next version of conformance tests. In that way we would have e.g., separate conformance tests for Tendermint versions 0.33 and 0.34. +- the number of tests will be higher than the number of abstract tests (several dozens vs dozens), but still much smaller then the resulting set of concrete tests (many hundreds).

Looking forward to your comments!

melekes commented 3 years ago

Sounds like the hybrid approach is a way to go, although I don't understand why one can't generate a set of tests per release and publish them for Rust and Go implementations to use.

the release binary will be able to generate static conformance tests on the fly.

Why generate static tests "on the fly"? Can't you include them with the release? I guess I don't understand the need to generate conformance tests more than once per release.

For conformance tests the non-determinism needs to be brought to zero: each conformance test needs a human approval to be sure that the test indeed conveys the desired scenario.

Are you saying that each of "many hundreds" tests will need to be reviewed?

andrey-kuprianov commented 3 years ago

@melekes thank you for the comments!

Sounds like the hybrid approach is a way to go, although I don't understand why one can't generate a set of tests per release and publish them for Rust and Go implementations to use. Why generate static tests "on the fly"? Can't you include them with the release? I guess I don't understand the need to generate conformance tests more than once per release.

We definitely could; it's only the size that will differ substantially. Each basic test is around 20 KB, fuzzing will mean duplicating each of those many times with some mutations -> We are talking already about MBs for each basic test with fuzzing. So storing basic tests won't be a problem, but storing the fuzzed ones will be. Also storing a basic test + a set of "fuzz vectors", or seeds, sounds like a cleaner solution somehow.

Are you saying that each of "many hundreds" tests will need to be reviewed?

The difference here is again between the basic auto-generated MBT tests, and the fuzzed ones. The basic MBT tests don't need manual inspection. The fuzzed ones in many cases I am afraid will -- currently I don't see a clear way to separate the mutated test that should be treated as correct from the one that should be incorrect. It is probably possible, but if we manually fix the "fuzz vectors" (those will describe which the data-structures are mutated and in which way). Some additional research is required here.

melekes commented 3 years ago

Also storing a basic test + a set of "fuzz vectors", or seeds, sounds like a cleaner solution somehow.

agree

informalsystems / tendermint-rs