paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.63k stars 573 forks source link

Zombienet: chaos testing [main issue] #792

Open sandreim opened 1 year ago

sandreim commented 1 year ago

The Vision

Testing Polkadot changes and releases before they go live is a multi-faceted challenge which we address by a plethora of different approaches, both automated and manual. The vision is to be able to test using an environment which is as close as possible to Kusama/Polkadot main nets. We've built and are continuously improving the core building blocks like ZombieNet as a means to get there and this is the first step that shifts our integration testing in this direction, but with reduced scope and scale. This milestone will make it possible to run long duration integration tests based on existing test cases but also include closer to real world parametrisation of the environment:

The Plan

Current status of integration testing

We currently cover functionality and small scale testing(up to 10 nodes) with a Zombienet test suite. While we continuously add more tests to it, it's all limited to lab networking conditions, meaning close to 0 latency/packet reordering and loss. While this is good from a basic functionality testing perspective, it doesn't actually cover edge cases or race conditions, that are usually hit in real world scenarios found in Kusama/Polkadot networks.

Moving towards to real production environment testing

Ideally we should be able to make Versi behave similarly to Kusama in terms of latency and behaviour but this would interfere with other types of testing we do as part of development on a regular basis. As an incremental improvement and reasonable compromise we should experiment with Zombienet based chaos testing. This touches a bit on negative testing and needs to tackle the following scenarios:

CI pipelines

Having this implemented per PR is not a good idea, as we want that one to be short and smooth allowing for fast turn around times during development. Instead we need to create a separate pipeline that takes a predefined mix of node versions and network topology and runs more or less the same PR test suite and measures the KPI metrics we currently follow as part of our monitoring and alerting.

Some key indicators of the network health:

Addressing the Zombienet scalability limits

We known that the current architecture of Zombienet supports out of the box around 100 nodes in total (validators + collators) for a single network spawn and test run in k8s. With the end goal of 1kV and 40+ parachains in mind we can implement some cheap changes to allow us to use multiple zombienet test instances that spawn nodes to join the same network (via shared bootnodes). For assertions we need to change the way zombienet checks metrics/logs such that it no longer scrapes the nodes individually, but rather calls Prometheus/Loki APIs to do so.

With these changes we should be able to scale to hundreds of nodes and tens of parachains.

TODO: cut and link granular issues for Zombienet changes, infrastructure and the actual tests.#

Open Questions

Currently the discussion is in the initial phase on the tracking issue, but some open questions that stand out are:

Project tracking board

https://github.com/orgs/paritytech/projects/73/views/1

sandreim commented 1 year ago

CC @pepoviola