Open jellonek opened 6 days ago
Here are my 2 cents to the current problems with system tests:
There should be a single test that asserts all the functional requirements we have for go-spacemesh:
TestSmeshing
TestEquivocation
and TestPostMalfeasanceProof
TestSmeshing
)TestSmeshing
TestSmeshing
TestAddNodes
and TestFailedNodes
TestsPoetsFailures
Non-functional tests probably require their own test each:
TestPartition_50_50
and TestPartition_30_70
)TestFallback
)TestShortTimeskew
)I am unsure if this needs its own test or can be part of TestSmeshing
:
TestCheckpoint
)The basic idea is that the big test spawns a network at the start and then runs multiple tests (partially at the same time) on the same network with the same nodes instead of having to tear down and setup a new network multiple times.
Additionally we should consider to cover all supported modes of smeshing during the one big test: some nodes should use supervised smeshing, some should use remote smeshing, some multi-smeshing (ideally with at least one of those marrying their identities).
This probably should be taken into account:
https://metoro.io/blog/go-production-performance-gotcha-gomaxprocs
TL;DR: GOMAXPROCS
may default to an unreasonable value in case of a containerized workload running on a machine together with a lot of other workloads. Not exactly sure if it is relevant, but probably worth checking
@fasmat It makes sense and it would speed up the tests significantly :+1:, but we need to be careful and avoid creating "subtests" running in parallel on a single network that depend on each other as it would be a nightmare to debug if things go south. I think it's possible to have a test structure like this:
func TestNetwork(t *testing.T) {
// setup a cluster
t.Run("sending transactions", ....) // send transactions and observe accounts changes
t.Run("poet failures", ....) // kills a poet and oberves that ATXs are still produced
.... more
The problem is, that once things stop working because the poet was killed, the other test(s) would most likely start to fail too and telling why (what's the root of the problem) might be difficult.
@ivan4th we have that hardcoded to 4
in https://github.com/spacemeshos/go-spacemesh/blob/cc65f271f3e21b519532f2924ab227ccf352c2de/systest/cluster/nodes.go#L914
Also regarding GOMAXPROCS
: I changed it to the suggestion in the blog @ivan4th linked, but didn't have a feeling it changed anything in terms of stability - multiple runs still were throttled, so I changed it back to the hard coded 4.
Description
At the moment systests contain a lot of spagetti code which seems to be time to time extended with another piece glued into it. In result: