Improve systests - Githubissues

jellonek commented 6 days ago

Description

At the moment systests contain a lot of spagetti code which seems to be time to time extended with another piece glued into it. In result:

instead of specifying a number of smeshers pods (named in comments as "containers") it has to be calculated from cluster size reduced by bootnodes, old nodes, possibly other pods. when attempted to use in place of "smesher" two pods (networking part, stateless worker part, as in node split) usage of cluster size (and also keys table in cluster context) went wild
pod definitions which are basically a configuration are constructed from code, while it should more probably be constructed from parametrised yaml templates rendered in runtime (that should improve readability of test cases and e.g. provide an option to have different configurations per smesher pod, instead of using the same config for all smesher pods)

fasmat commented 6 days ago

Here are my 2 cents to the current problems with system tests:

There should be a single test that asserts all the functional requirements we have for go-spacemesh:

nodes that successfully publish ATXs are eligible to participate in Hare and receive rewards, nodes that are not do not receive rewards - currently covered by TestSmeshing
- nodes that equivocate (e.g. by double publishing or publishing an invalid PoST) lose their eligibility for rewards - currently covered by TestEquivocation and TestPostMalfeasanceProof
- nodes agree on the state of eligibilities - currently covered by multiple tests (e.g. TestSmeshing)
Spawning accounts and sending funds - currently covered by TestSmeshing
vesting accounts gain access to their tokens based on the schedule they are supposed to get access to - currently covered by TestSmeshing
check if beacon and active set generation works as expected - at the moment that would be via bootstrapping Test should behave as we do on mainnet and should be updated when we don't need bootstrapping any more
adding and removing nodes from the network works, i.e. an added node is able to sync from scratch and becomes eligible for rewards after publishing their first (valid) ATX. And removing multiple (up to 60% of) eligible nodes doesn't cause the network to halt or have other problems - currently covered by TestAddNodes and TestFailedNodes
if one (or multiple) poets fail the network still operates as expected, as long as at least one PoET is available - currently covered by TestsPoetsFailures

Non-functional tests probably require their own test each:

Test if the network heals from a state of partition (TestPartition_50_50 and TestPartition_30_70)
Test if beacon generation works even in a fallback state (TestFallback)
Test that a timeskew between nodes doesn't cause the network to disagree on timings (TestShortTimeskew)

I am unsure if this needs its own test or can be part of TestSmeshing:

Checkpoint functionality (currently tested by TestCheckpoint)

The basic idea is that the big test spawns a network at the start and then runs multiple tests (partially at the same time) on the same network with the same nodes instead of having to tear down and setup a new network multiple times.

Additionally we should consider to cover all supported modes of smeshing during the one big test: some nodes should use supervised smeshing, some should use remote smeshing, some multi-smeshing (ideally with at least one of those marrying their identities).

ivan4th commented 6 days ago

This probably should be taken into account: https://metoro.io/blog/go-production-performance-gotcha-gomaxprocs TL;DR: GOMAXPROCS may default to an unreasonable value in case of a containerized workload running on a machine together with a lot of other workloads. Not exactly sure if it is relevant, but probably worth checking

poszu commented 5 days ago

@fasmat It makes sense and it would speed up the tests significantly :+1:, but we need to be careful and avoid creating "subtests" running in parallel on a single network that depend on each other as it would be a nightmare to debug if things go south. I think it's possible to have a test structure like this:

func TestNetwork(t *testing.T) {
  // setup a cluster
  t.Run("sending transactions", ....) // send transactions and observe accounts changes
  t.Run("poet failures", ....) // kills a poet and oberves that ATXs are still produced
  .... more

The problem is, that once things stop working because the poet was killed, the other test(s) would most likely start to fail too and telling why (what's the root of the problem) might be difficult.

jellonek commented 4 days ago

@ivan4th we have that hardcoded to 4 in https://github.com/spacemeshos/go-spacemesh/blob/cc65f271f3e21b519532f2924ab227ccf352c2de/systest/cluster/nodes.go#L914

fasmat commented 4 days ago

Also regarding GOMAXPROCS: I changed it to the suggestion in the blog @ivan4th linked, but didn't have a feeling it changed anything in terms of stability - multiple runs still were throttled, so I changed it back to the hard coded 4.

spacemeshos / go-spacemesh

Improve systests #6494

Description