such tests are meant to verify how application behaves over the period of time under various chaos conditions. as such there will be a number of recurring tasks
that will schedule some sort of chaos tests and verify that application was able to recover.
make cluster reusable
cluster setup must be able to recover cluster configuration from the deployed cluster
also tests should not use hardcoded layer. all expectations must be set relatively to the start of the test (e.g. +X layers to the current layer, and not genesis.).
endpoints
recover from the deployed pod
configuration
recover from spacemesh API
private keys with funds
persist them with configmap, and recover from configmap
nonce
recover nonce from API
tests scheduling
longevity tests will be implemented as a sequence (graph) of tasks. some will be running concurrently. this sequences should repeat on some schedule (e.g. cron).
we have an option to implement such scheduler ourselves (e.g. just by using writing additional golang code), or reuse something native for k8s. potentially we will get more flexibility if we do it oursleves, but it may not be required.
task is a piece of code that implements setup, chaos or verification logic. in existing systests all 3 types are merged into the test. for the benefit of code reuse we may want to split them.
in such case test will be composed from atleast 3 concurrent tasks. however separating chaos from testing
logic will lead to lack of information that is required for correct recovery or verification. for example
in case of majority/minority partition chaos - verification and recovery will be different based on which side of the partition node is ended up.
so for practical reasons it is allowed to keep all 3 parts (setup, chaos, verification) in the same golang test, and use
each golang test as a task.
verification
sanity
test that blocks are produced and transactions are executed. additionally we may test that different parts of the protocols work as expected by exposing more information over grpc streams (one example is tortoise beacon).
correctness
we should make use of porcupine and adapt it to test linearizability of the consensus protocol under chaos. basically we will have our own setup for jepsen tests.
such tests are meant to verify how application behaves over the period of time under various chaos conditions. as such there will be a number of recurring tasks that will schedule some sort of chaos tests and verify that application was able to recover.
make cluster reusable
cluster setup must be able to recover cluster configuration from the deployed cluster
also tests should not use hardcoded layer. all expectations must be set relatively to the start of the test (e.g. +X layers to the current layer, and not genesis.).
endpoints
recover from the deployed pod
configuration
recover from spacemesh API
private keys with funds
persist them with configmap, and recover from configmap
nonce
recover nonce from API
tests scheduling
longevity tests will be implemented as a sequence (graph) of tasks. some will be running concurrently. this sequences should repeat on some schedule (e.g. cron).
we have an option to implement such scheduler ourselves (e.g. just by using writing additional golang code), or reuse something native for k8s. potentially we will get more flexibility if we do it oursleves, but it may not be required.
pingcap framework argo-workflows
tasks
task is a piece of code that implements setup, chaos or verification logic. in existing systests all 3 types are merged into the test. for the benefit of code reuse we may want to split them.
in such case test will be composed from atleast 3 concurrent tasks. however separating chaos from testing logic will lead to lack of information that is required for correct recovery or verification. for example in case of majority/minority partition chaos - verification and recovery will be different based on which side of the partition node is ended up.
so for practical reasons it is allowed to keep all 3 parts (setup, chaos, verification) in the same golang test, and use each golang test as a task.
verification
sanity
test that blocks are produced and transactions are executed. additionally we may test that different parts of the protocols work as expected by exposing more information over grpc streams (one example is tortoise beacon).
correctness
we should make use of porcupine and adapt it to test linearizability of the consensus protocol under chaos. basically we will have our own setup for jepsen tests.
TODO exlore elle (go-elle)
performance and stability
verify that metrics are kept in the expected bounds. TBD as it depends on the observability which is mostly lacking.
TODO
NOTE(dshulyak) maybe create separate issues