paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.8k stars 652 forks source link

Block production/forking issues caused by malicious/not synced peers #5624

Open nazar-pc opened 3 weeks ago

nazar-pc commented 3 weeks ago

Long time ago we had an issue with forking on our permissionless network that resulted in us carrying this Substrate patch for a few years now: https://github.com/autonomys/polkadot-sdk/commit/9b09c04911c1df177da675c74ee840edf38ece43

I'd like to finally upstream a version of that change in some form. We were using that patch since August 2022 on various test networks with thousands of consensus nodes. It works and does what it is supposed to.

I tried to convince Santiago and other folks from builders program that this is an issue, but it didn't go anywhere back then.


What we observed is that when many thousands of new consensus nodes joined a relatively small network within very short period of time (hours), many of them started forking despite not being synced fully. Turned out the reason is that their peers were not synced either, so as far as they are concerned, they are at the tip of the chain they can observe, even though it is not the best chain in absolute terms.

Since in contrast to Proof-of-Stake Polkadot our network is fully permissionless, it affects us to a larger degree. It still affects Proof-of-Stake chains in a sense that this can disrupt syncing, but at least it will not result in forking most of the time due to permissioned nature of the consensus.

The solution I came up with (in above mentioned patch) was for nodes to voluntarily announce to each other if they think they are synced, such that their peers can use that information to improve decisions about sync target, ignoring those that are not synced yet. This results in a bit more deliberate bootstrapping of the new network where first node needs to be force-synced (hence added CLI option) and then at least 2 nodes on the network need to maintain connectivity at any time in order for them to remain "synced" and for block production to happen on the network.


BTW, I have never got to writing PoC, but I think the opposite extreme might be problematic even for Polkadot/Proof-of-Stake chains, where large number of "fake" peers pretending to be at a higher block level could force node into major syncing and disrupt block production that way for a period of time until nodes discover that those peers are not actually presenting correct tip. The effect from this will depend on request/response timeouts and such, but likely is a real test vector to which I do not have an easy solution.