Benchmark network stack CPU usage

sandreim commented 2 months ago

Currently the only way to catch network stack performance regressions or compare litep2p with libp2p is to actually run the nodes in a network and look at the CPU usage metrics of the networking tasks.

Implementing such a network stack performance benchmark (perhaps as part of the subsystem benchmarks) would provide the following benefits:

test expected gains from further optimization work
estimate the network stack CPU usage at arbitrary scale and the traffic profile of parachain consensus gossip and req/resp protocols.
catch regression with CI tests

CC @AndreiEres @dmitry-markin

dmitry-markin commented 6 days ago

We had a discussion with @AndreiEres & @lexnv, and to follow up, I'd like to summarize the understanding of the issue from the networking team perspective:

What we would like to have is benchmarking of individual network protocols. The protocols in questions are high level network protocols like Notifications, RequestResponses, and Kademlia. With libp2p network backend they are partially implemented in substrate (like Notifications & RequestResponses), in litep2p they are moved to the networking library itself.

The understanding of the individual protocols' performance will provide more useful information than having generalized performance measurements of the abstract "networking stack" as part of subsystem benchmarks, based on combination of these protocols.
Implementing the benchmarks of individual protocols is more straightforward and simple than implementing network performance bench based on subsystem benchmarks. For example, for Notifications & RequestResponsesprotocols it can be as simple as spawning just two nodes comprising solely from Litep2pNetworkBackend or NetworkWorker (aka "Libp2pNetworkBackend") and sending a stream of notifications or a series of requests with configurable payload sizes. For Kademlia more complex setup is needed to spawn multiple nodes to form a DHT and execute different queries on it, but it is still simpler than implementing a sysbsystem bench and tricking it to use Kademlia protocol heavily.
The measurement of performance of individual protocols will allow direct comparison of libp2p and litep2p, with understanding of what part of the networking stack to focus optimization efforts on. By measuring specific protocols we will understand what part of the library a regression comes from, if we receive a CI alert.

It would be really helpful if you could implement such benchmarks, as we are going to need them anyway

alexggh commented 6 days ago

Implementing the benchmarks of individual protocols is more straightforward and simple than implementing network performance bench based on subsystem benchmarks

There is a disadvantage here, currently the subsystem-bench are a good tool for estimating the load of subsystem on real networks, it allows you to configure high level properties of the network(num_validators, num_cores, num_parachains, num_candidates to validate) and it generate the equivalent messages a node would have to process.

Currently, the oversimplified setup is: Messages Generator -> Real overseer and Real orchestra subsystems -> Mocked Network stack., with this setup you can realistically estimated the usage of the Real overseer and Real orchestra subsystems in realistic traffic conditions, extending this setup to:

Messages Generator -> Real overseer and Real orchestra subsystems -> Real p2p network stack ->Mocked OS networking primitives` , allows you to realistically simulate and estimate even more parts of our system running together in stress conditions, so I think that's valuable.

What I understand you are suggesting implementing a pipeline like: Mocked Traffic -> Real p2p network stack -> Mocked OS networking primitives

I think that's a good first step and it allows you for comparison between Litep2p and Libp2p backend, but it doesn't cover the part where you benchmark as much as possible from the node running all together.

My suggestion would be, when we implement this to also have in mind the longer pipeline I suggested above, so that we would still be able to glue everything together at some point in the future.

sandreim commented 6 days ago

+1 on what @alexggh said above.

3. The measurement of performance of individual protocols will allow direct comparison of libp2p and litep2p, with understanding of what part of the networking stack to focus optimization efforts on. By measuring specific protocols we will understand what part of the library a regression comes from, if we receive a CI alert.

The subsystem benchmarks will report the networking stack usage as part of the tests we already have. So you can compare the libp2p and litep2p performance in a more realistic scenario, which is what we aim to do.

lexnv commented 6 days ago

Sounds like a plan! Thanks for the clarifications! 🙏

Indeed the subsystem benchmark brings a good improvement from the current state.

We can handle the protocol-specific benchmarks as part of the networking team, since we'd need to know if new changes impact the performance of the lower-level components and make informed decisions about optimizations

AndreiEres commented 5 days ago

Sorry for being late to the party, and sorry for the delay in providing the collected information. Here is my understanding of the current situation.

In Polkadot, we are replacing libp2p with litep2p, a more lightweight and efficient alternative. This change should enhance the network stack, but the only way to evaluate its performance is to run the nodes and examine the metrics. We need a tool to measure the performance of the network stack in the parachains protocol. This tool will allow us to compare the two libraries and estimate overall CPU usage. At the same time, we must remember that the network stack consists of various protocols, such as Notifications, RequestResponses, and Kademlia. While the stack based on litep2p may function better, specific protocols may underperform. Therefore, in addition to overall CPU usage, we also need detailed measurements focused on individual protocols.

Thus, the actual work can be divided into two different tasks that are not connected to each other and can be implemented independently:

Add network stack to subsystem benchmarks to estimate CPU usage at arbitrary scale in the tested part of parachain consensus, track optimization gains, and catch regressions.
Implement benchmarks for individual network protocols to compare libp2p and litep2p, and focus optimization efforts.

Using the analogy of testing, the first benchmark is the integration test, while the second consists of several unit tests.

How do you guys find this approach? What should I change or add?

paritytech / polkadot-sdk

Benchmark network stack CPU usage #5220