Open sandreim opened 2 years ago
I think we need to split this up, into a PR pipeline and a release pipeline. All open issues should point to issues that add additional context that's required for implementation. Percantage based logs are a nice to have, since those tests should be rather deterministic, so this is a bit of a longer shot, but that goes hand in hand with scaling up the number of validators and grouping.
Hi @sandreim, thanks for the feedback. I think there are several things to work in this issue but the priority is to add support for scaling the network easily right? The validation groups
sounds great. Let me start working on the syntax for supporting this and we can use that as starting point.
Thanks!
Hi @sandreim, thanks for the feedback. I think there are several things to work in this issue but the priority is to add support for scaling the network easily right? The
validation groups
sounds great. Let me start working on the syntax for supporting this and we can use that as starting point.Thanks!
Yes, validator groups and being able to spin up many validators (not just the limited set we have now) and parachains. Other than that, launching them in parallel would also help iterate faster in development would be good to start with.
We're looking at writing an integration test suite that focuses on performance testing, more specifically on a list of key indicators that are covered in https://github.com/paritytech/polkadot-sdk/issues/874. The current design of
Zombienet
for configuration andDSL
make it an easy to write tests for single digit sized networks and provides very explicit primitives for testing metrics and logs (alice: parachain 100 block height is at least 10 within 200 seconds
). I'll focus on what I think we need to implement to make writing tests easy for test scenarios of an order of magnitude larger at least.I'm breaking down everything down into two: Test configuration and the DSL.
Test configuration
In the context of higher scale, the goal is to enable the configuration to be defined in bulk, such that we don't need to talk about individual validators and their configuration (binary and args), which is cumbersome for 100 validators for example.
Where we are at
relaychain
section.Where we want to be
DSL
section.Test scenario (DSL)
The goal is to enable writing test assertions that looks at groups of validators rather than only one.
Where we are at
natural
language which includes a node target, a condition and a timeout.connect/run
callbacks)Where we want to be
ValidatorGroup1: parachain 100 block height is at least 10 within 300 seconds
ValidatorGroup1(P90): reports polkadot_parachain_disputes_finality_lag is at most 1 within 300 seconds
. This example will ensure at least 90% of all validators report dispute_finality lag being 1 block.natural
language more complicated by implementing things that can be easily done in Javascript.Issues and other improvements
I've stumbled upon some issues or missing functionality:
collator01
, but the actual name that I must reference in the test iscollator01-1
, but still failsalice: reports polkadot_pvf_preparation_time_bucket{le="1"} is at least 1
CI integration
It doesn't seem to be a good idea to have these tests run as part of the per PR pipeline, because of the long duration and high cost of scaling the kubernetes. My proposal is to run a subset of small scale variants of the tests on the PR pipeline and run the high scale tests at release checkpoints or on a need to basis.
That being said, it looks like a lot of work, and at the same time we want to run these high scale tests sooner rather than later. My proposal is to build this incrementally starting with what I consider to be the MVP:
Link to a branch with a sample test and some comments to add more context: TBD.