bitfield-distribution: subsystem queue seems to get full

I did some investigations on bitfield-distribution being clogged sometimes and all data points that the throughput of the system on average is more than sufficient, this is backed by multiple data sources like:

Subsystem benchmarks shows that processing all bitfields for 500 validators takes around 50ms of cpu time on the reference hardware.
Looking at CPU usage on kusama nodes it does not go over 4%.

This leads me to think this rare occasions of the susbystem being clogged are just of bursts of messages that happen because of the fact that all validators decide to send their bitfield all of the same time.

Doing some math on the total number of message it looks like this:

We have 500 validators, so there are 500 unique bitfield messages.
Each unique messages can be received by a node 6 times(2 times because of X and Y neighbour and 4 times because any messages is also gossiped randomly to 4 peers)
Hence a node can receive 3000 (500 * 6) bitfield messages per relay chain block.
Now the clogging seems to be correlated with relay-chain forks I see cases on kusama where we have 3 or 4-way forks, in that case you have a node receiving up to 3000 * 4 = 12_000 bitfield messages all coming around the same time. The messages get processed really fast, because we don't see them gathering and Time of flight for all messages seem to be almost always bellow 100ms, with most of them bellow 100 micro-seconds.
Bitfield distribution uses the default message_capacity=2048, so I think that's why when we have this bursts of messages caused by relay-chain forks the queue gets full, important to note here is that this happens very rarely like ~4 times a day.

Clogging on the subsystem queue, even briefly, is really bad because because it blocks the sender and in this case the sender is network-bridge-rx which is dispatching communication for all the other subsystems, so we want to avoid it entirely or minimize it, for this we have 2 low hanging fruits that we should do:

Increase the message_capacity, I propose setting it to 8192, the only downside here is that we slightly increase the memory footprint when the queue would be at max capacity, our messages have around 1k, so we would go from a theoretical max of 2MiB for this subsystem queue to 8MiB, I think that's a trade-off perfectly acceptable because production nodes are suppose to be running with at least 32GiB MiB, so this is really negligible.
Make the subsystem run on a blocking task, this would have two benefits. First, it should make the subsystem quicker to react because it gets its own thread rather than share the task pool with everyone else. Secondly, the subsystem does some signature checking here: https://github.com/paritytech/polkadot-sdk/blob/86bb5cb5068463f006fda3a4ac4236686c989b86/polkadot/node/network/bitfield-distribution/src/lib.rs#L677, which is a CPU intensive task and running on the blocking pool is the recommended behaviour to reduce the impact on the other tasks in the tokio-pool.

Proposed fix: https://github.com/paritytech/polkadot-sdk/pull/5787

paritytech / polkadot-sdk

bitfield-distribution: subsystem queue seems to get full #5657