Block production/forking issues caused by malicious/not synced peers

nazar-pc commented 2 months ago

Long time ago we had an issue with forking on our permissionless network that resulted in us carrying this Substrate patch for a few years now: https://github.com/autonomys/polkadot-sdk/commit/9b09c04911c1df177da675c74ee840edf38ece43

I'd like to finally upstream a version of that change in some form. We were using that patch since August 2022 on various test networks with thousands of consensus nodes. It works and does what it is supposed to.

I tried to convince Santiago and other folks from builders program that this is an issue, but it didn't go anywhere back then.

What we observed is that when many thousands of new consensus nodes joined a relatively small network within very short period of time (hours), many of them started forking despite not being synced fully. Turned out the reason is that their peers were not synced either, so as far as they are concerned, they are at the tip of the chain they can observe, even though it is not the best chain in absolute terms.

Since in contrast to Proof-of-Stake Polkadot our network is fully permissionless, it affects us to a larger degree. It still affects Proof-of-Stake chains in a sense that this can disrupt syncing, but at least it will not result in forking most of the time due to permissioned nature of the consensus.

The solution I came up with (in above mentioned patch) was for nodes to voluntarily announce to each other if they think they are synced, such that their peers can use that information to improve decisions about sync target, ignoring those that are not synced yet. This results in a bit more deliberate bootstrapping of the new network where first node needs to be force-synced (hence added CLI option) and then at least 2 nodes on the network need to maintain connectivity at any time in order for them to remain "synced" and for block production to happen on the network.

BTW, I have never got to writing PoC, but I think the opposite extreme might be problematic even for Polkadot/Proof-of-Stake chains, where large number of "fake" peers pretending to be at a higher block level could force node into major syncing and disrupt block production that way for a period of time until nodes discover that those peers are not actually presenting correct tip. The effect from this will depend on request/response timeouts and such, but likely is a real test vector to which I do not have an easy solution.

bkchr commented 2 weeks ago

Hey sorry for the late reply. At some point I already started to write an answer to this issue, but I forgot about it.

I also think that the issue is real and should be fixed. However, maybe we can make the block production a little bit smarter. Like we should check the timestamp of the best block and if that is too much behind the current time, the node should not try to produce a block. This way we could entirely stop relying on what we think the network state is. For sure we should introduce some kind of fallback cli flag to force block production whatever the best block is. This would help when there was a bug and requires to hardfork the network or whatever.

Probably this "did we imported a recent block" recently could be used everywhere to replace the is_major_syncing check. The time window to determine "recent block" could be derived from the slot time, like 10-20 times of the slot time. If a node starts to produce like a wasted block 2 mins behind the best chain, I think it is fine and should only happen on sync.

nazar-pc commented 2 weeks ago

Like we should check the timestamp of the best block and if that is too much behind the current time, the node should not try to produce a block.

Feels like a hack rather than a proper solution, but it should work.

What I don't particularly like about this is that it assumes there is a way to derive time from a block. While both BABE and Subspace have it and many other consensus engines too, I'm not sure this is a universal enough concept to make it a hard requirement.

That said the fact that it doesn't require any network-level changes is appealing.

We had to launch mainnet with a fork that has is_synced, could have saved ourself some work in the future if this idea surfaced earlier :melting_face:

For sure we should introduce some kind of fallback cli flag to force block production whatever the best block is.

We already have --force-synced in addition to --force-authoring on our end for network bootstrapping purposes.

bkchr commented 2 weeks ago

I'm not sure this is a universal enough concept to make it a hard requirement.

We don't need to make it a hard requirement for the entire of Substrate, but for BABE/AURA etc. Also Substrate has other assumptions already backed in.

Do you have any other ideas that would make it less hacked? Depending on the networking is IMO not a good solution because we can never really trust it and we will see similar issues.

burdges commented 1 week ago

We know protocols like ouroboros chronos that estimate some network time from others recent block times, btw.

What we observed is that when many thousands of new consensus nodes joined a relatively small network within very short period of time (hours), many of them started forking despite not being synced fully.

At least for validators, we'll eventually more this on-chain:

insert randomness into the back certificates we sign during startup,
insert our fresh back certificates on-chain,
delay activating protocols like babe/sassafras and grandpa until out back certificates get finalized,
shut down if some new unfamilar back certificate gets registered for our controller, which avoids equivocation slashes.

nazar-pc commented 1 week ago

Do you have any other ideas that would make it less hacked? Depending on the networking is IMO not a good solution because we can never really trust it and we will see similar issues.

I wouldn't call it networking, this is part of the sync mechanism. Just like nodes announce to each other what their tip is so node can decide whether there is a better tip to sync to, they can also indirectly announce that they see a higher tip, but but didn't verify/import it yet, hence they are still syncing themselves and node can ignore their tip because it isn't yet accurate anyway.

At least for validators, we'll eventually more this on-chain:

I have no idea what back certificates are and we don't have Proof-of-Stake in our protocol either, so I don't think it would be applicable anyway.

bkchr commented 1 week ago

Just like nodes announce to each other what their tip is so node can decide whether there is a better tip to sync to, they can also indirectly announce that they see a higher tip, but but didn't verify/import it yet, hence they are still syncing themselves and node can ignore their tip because it isn't yet accurate anyway.

Okay sounds like a good idea as well.

paritytech / polkadot-sdk

Block production/forking issues caused by malicious/not synced peers #5624