Use Glutton to stress test parachain consensus on Versi

sandreim commented 1 year ago

After https://github.com/paritytech/parachains-utils/pull/1 is merged we are ready to deploy Glutton on all cores on Versi. It will be the first time we will collect data from Versi load testing with big parachain blocks that burn CPU and cause high network I/O.

As a testing strategy we will be doing 3 types of tests:

100% CPU burning only test (500ms execution)
availability stressing (maxed out PoV)
mixed (maxed out CPU and PoV).

Testing environment: 300 parachain validators and 50 parachains. If it doesn't break, and we would benefit from collecting data at higher scale we might want to dial the numbers up to 500 validators and 70 parachains.

For each strategy we should follow an incremental approach which will allow us to observe metrics and logs.

Start small: 50% load on PoV size or CPU
Ramp up by 10% increments until we reach 100% of what is possible.
On each iteration collect metrics and document for later analysis

NachoPal commented 1 year ago

@sandreim Deployment is being coordinated here: https://github.com/paritytech/devops/issues/2567

sandreim commented 1 year ago

Testing scenario 1

200 validators
40 parachains
100 needed approvals
30 vrf modulo samples
6000 assignment (tranche0) + 6000 approval messages at current scale ( not counting higher tranches)
50% CPU burn, 50% PoV size usage

Observations

High recovery times 3-5s and up to 10s - https://grafana.parity-mgmt.parity.io/d/asdadasd1/parachains-liveliness-all-networks?orgId=1&var-chain=versi_v1_10&var-data_source=thanos&from=1687327946927&to=1687333746090&viewPanel=68
Time to check goes above 10s at times https://grafana.parity-mgmt.parity.io/goto/3scSO1X4R?orgId=1
Without fast path enable for bigger PoVs (2.5 MB) there are a lot of no shows and above metrics are much worse
Approval distribution and approval voting
- Due to the latency induced by PoV recovery via chunks, the message distribution seemed better (no more bursts) but I have to double check these results
- With fast path there is an increase in burstyness but things are still good in terms of queue sizes and not blocking the network bridge yet
Availability recovery
- Looking at the code, we have a very expensive operation reconstructed_data_matches_root that we don’t really time separately
- craft a PR to add a metric for this reconstructed_data_matches_root heavy call that is used on all paths - https://github.com/paritytech/polkadot/pull/7409
- https://grafana.parity-mgmt.parity.io/goto/pM1cdJX4g?orgId=1 shows that the recovery task will block the async thread for quite a while, we need to do this in the background (used only for chunks, but reconstructed_data_matches_root is called as part of it)
PVF execution
- Looks a bit higher than expected (assuming 250ms burn in Glutton) with few outliers above 1s - https://grafana.parity-mgmt.parity.io/goto/kaR1OJu4R?orgId=1
- This likely needs more dive deep to understand the difference - maybe Glutton burns more than expected
Node CPU usage
- https://grafana.parity-mgmt.parity.io/goto/czrQdJuVR?orgId=1
- As expected we can observe that the networking stack uses most CPU, followed by recovery tasks (see above some potential reasons and improvements)
- Approval voting and approval distribution come in on third place

Initial conclusions

We need more metrics and investigation around the slowness of PoV recovery and testing with 100% PoV usage and fetch from backers fast path enabled
Availability recovery appears to be a bottleneck as we’ll scale up number of validators and for larger PoVs even with current fast path
CPU usage is still dominated by networking (as seen before in other tests)
Time to recover is Cleary too high varying from 4s to 11s - https://grafana.parity-mgmt.parity.io/goto/Pvhd8JXVR?orgId=1

ggwpez commented 1 year ago

What Hardware are these validators running on?
That also plays into how much time the glutton consumes, since it orientates itself on something like i7-7700K.

Looking at https://github.com/paritytech/polkadot/pull/7409: There is a reconstruct benchmark cargo bench -p polkadot-erasure-coding and that reports 74ms for 200 validators on i9-13900K when changing it to 2.5 MiB proof.
Maybe you can try that on the validator hardware to compare?
PS: Just saw https://github.com/paritytech/reed-solomon-novelpoly/pull/2, no idea if that has any chance of landing.

sandreim commented 1 year ago

We are using https://cloud.google.com/compute/docs/general-purpose-machines#n2d_machines AFAIK. CC @PierreBesson to confirm. We try to use reference hardware as recommended https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware .

What hardware are we running weights benchmark these days?

ggwpez commented 1 year ago

It was still done on the old ref hardware, i7-7700K, but we recently updated to the recommendation from the Wiki https://github.com/paritytech/polkadot/pull/7342 for Cumulus the same thing is in the pipeline here https://github.com/paritytech/cumulus/pull/2712.

The CPU name is always at the top of the weight files to give a rough indication (old vs new):
So the consumption should be closer once the Cumulus MR is merged. You could also set the CPU burn to 0% to just measure the overhead (dont know if that applies in this case).

burdges commented 1 year ago

30 vrf modulo samples

Is this meant to come from computation we discussed in

We should keep https://github.com/paritytech/polkadot-sdk/issues/640 in mind somewhat here

relayVrfModuloSamples = E * num_cores / num_validators = 100 needed approvals * 40 parachains / 200 validators = 20 < 30 vrf modulo samples

That's a fine deviation. Also, I guess the needed approvals is so high because we want to think in terms of some miss-behavior. Yet, we also care about maybe 120 parachains with 30 needed approvals.

Or maybe this type of configuration is a nice way to check the load of just the approvals system without running as many collators?

sandreim commented 1 year ago

Yes, the purpose is to do more load with less validators and parachains.

paritytech / polkadot-sdk

Use Glutton to stress test parachain consensus on Versi #620