paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.89k stars 696 forks source link

Cache request protocol version in availability-recovery #3127

Closed alindima closed 5 months ago

alindima commented 9 months ago

Prerequisite: https://github.com/paritytech/polkadot-sdk/pull/1644

See https://github.com/paritytech/polkadot-sdk/pull/1644#issuecomment-1916468621 for details of the improvements

alindima commented 5 months ago

There are a couple of ways to implement this:

  1. Record the protocol used for receiving the response from our peer and cache it in the subsystem. This gets quite messy to implement since we run recoveries in parallel for multiple candidates so we'd need to have and mutate a shared cache between recovery tasks.
  2. Expose the responses of the Identify protocol and record them in the subsystem. These contain the list of supported protocols of our peer and is being fetched on a new connection. This needs some extra support in substrate

Now, the caveat of both approaches is that they are an optimisation that's only effective while not all validators are upgraded. Once they're all upgraded, the code will be redundant and would potentially send/record unnecessary events. Moreover, production networks rarely do chunk recovery for now. Most of the time they simply fetch the full data from backers (since most POVs are less than 128Kib in compressed size).

In the worst case, with a mixed validator set (half updated, half unupdated), the updated nodes will make an extra round-trip when fetching chunks from unupdated nodes.

I measured this in practice and the cost is negligible considering total POV recovery time.

Measuring this with subsystem-bench (with an extra latency of 100ms for the second request):

Screenshot 2024-05-22 at 17 12 42

The first half simulates all nodes making 2 round trips for all chunk requests.

I also measured this in versi, with 50 validators and 9 glutton parachains and POVs of 2.5 mib.

The average PoV recovery time with all unupgraded nodes is 528ms. The average PoV recovery time will half upgraded and half unupgraded nodes is 674 ms.

As you can see, the large consumer of recovery time is reed-solomon.

Considering all the above, I'll close this issue and conclude that this small optimisation is not worth implementing