systest: longevity testing

such tests are meant to verify how application behaves over the period of time under various chaos conditions. as such there will be a number of recurring tasks that will schedule some sort of chaos tests and verify that application was able to recover.

make cluster reusable

cluster setup must be able to recover cluster configuration from the deployed cluster

also tests should not use hardcoded layer. all expectations must be set relatively to the start of the test (e.g. +X layers to the current layer, and not genesis.).

endpoints

recover from the deployed pod

configuration

recover from spacemesh API

private keys with funds

persist them with configmap, and recover from configmap

nonce

recover nonce from API

tests scheduling

longevity tests will be implemented as a sequence (graph) of tasks. some will be running concurrently. this sequences should repeat on some schedule (e.g. cron).

we have an option to implement such scheduler ourselves (e.g. just by using writing additional golang code), or reuse something native for k8s. potentially we will get more flexibility if we do it oursleves, but it may not be required.

pingcap framework argo-workflows

tasks

task is a piece of code that implements setup, chaos or verification logic. in existing systests all 3 types are merged into the test. for the benefit of code reuse we may want to split them.

in such case test will be composed from atleast 3 concurrent tasks. however separating chaos from testing logic will lead to lack of information that is required for correct recovery or verification. for example in case of majority/minority partition chaos - verification and recovery will be different based on which side of the partition node is ended up.

so for practical reasons it is allowed to keep all 3 parts (setup, chaos, verification) in the same golang test, and use each golang test as a task.

verification

sanity

test that blocks are produced and transactions are executed. additionally we may test that different parts of the protocols work as expected by exposing more information over grpc streams (one example is tortoise beacon).

correctness

we should make use of porcupine and adapt it to test linearizability of the consensus protocol under chaos. basically we will have our own setup for jepsen tests.

TODO exlore elle (go-elle)

performance and stability

verify that metrics are kept in the expected bounds. TBD as it depends on the observability which is mostly lacking.

TODO

NOTE(dshulyak) maybe create separate issues

[ ] make cluster reusable
[ ] scheduler integration (argo-workflows)
[ ] use porcupine for testing linearizability
[ ] metrics

spacemeshos / go-spacemesh