paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.8k stars 652 forks source link

bitfield-distribution: subsystem queue seems to get full #5657

Closed sandreim closed 6 days ago

sandreim commented 2 weeks ago

On Kusama we can observe that the channel gets full every once in a while, leading to a brief stall of the network bridge. This has been happening as we increased the number of valdiators which increased the number of bitifeld messages in the network.

The bitfield gossip is bursty as all nodes setup a timer of 1.5s when they import the block. When timer expires they all send out their bitfield to other validators.

We need to investigate this further and see if it is a potential problem when we scale up to 1k validators. We might want to optimize this a bit or maybe just having a larger subsystem channel size to absorb these bursts is enough.

Screenshot 2024-09-10 at 11 12 10
alexggh commented 1 week ago

I did some investigations on bitfield-distribution being clogged sometimes and all data points that the throughput of the system on average is more than sufficient, this is backed by multiple data sources like:

This leads me to think this rare occasions of the susbystem being clogged are just of bursts of messages that happen because of the fact that all validators decide to send their bitfield all of the same time.

Doing some math on the total number of message it looks like this:

  1. We have 500 validators, so there are 500 unique bitfield messages.
  2. Each unique messages can be received by a node 6 times(2 times because of X and Y neighbour and 4 times because any messages is also gossiped randomly to 4 peers)
  3. Hence a node can receive 3000 (500 * 6) bitfield messages per relay chain block.
  4. Now the clogging seems to be correlated with relay-chain forks I see cases on kusama where we have 3 or 4-way forks, in that case you have a node receiving up to 3000 * 4 = 12_000 bitfield messages all coming around the same time. The messages get processed really fast, because we don't see them gathering and Time of flight for all messages seem to be almost always bellow 100ms, with most of them bellow 100 micro-seconds.
  5. Bitfield distribution uses the default message_capacity=2048, so I think that's why when we have this bursts of messages caused by relay-chain forks the queue gets full, important to note here is that this happens very rarely like ~4 times a day.

Clogging on the subsystem queue, even briefly, is really bad because because it blocks the sender and in this case the sender is network-bridge-rx which is dispatching communication for all the other subsystems, so we want to avoid it entirely or minimize it, for this we have 2 low hanging fruits that we should do:

  1. Increase the message_capacity, I propose setting it to 8192, the only downside here is that we slightly increase the memory footprint when the queue would be at max capacity, our messages have around 1k, so we would go from a theoretical max of 2MiB for this subsystem queue to 8MiB, I think that's a trade-off perfectly acceptable because production nodes are suppose to be running with at least 32GiB MiB, so this is really negligible.

  2. Make the subsystem run on a blocking task, this would have two benefits. First, it should make the subsystem quicker to react because it gets its own thread rather than share the task pool with everyone else. Secondly, the subsystem does some signature checking here: https://github.com/paritytech/polkadot-sdk/blob/86bb5cb5068463f006fda3a4ac4236686c989b86/polkadot/node/network/bitfield-distribution/src/lib.rs#L677, which is a CPU intensive task and running on the blocking pool is the recommended behaviour to reduce the impact on the other tasks in the tokio-pool.

Proposed fix: https://github.com/paritytech/polkadot-sdk/pull/5787