Block verification checks for pre-finalized/conflicting-with-finalized blocks are flaky

michaelsproul commented 1 day ago

Description

Similar to other database crash safety bugs, this one relates to a bad assumption about state after an unclean shutdown. Specifically the pre_finalization_cache assumes that a block present on disk but not in fork choice must be a finalized block that has been pruned from fork choice:

https://github.com/sigp/lighthouse/blob/6329042628ea0afcbcbce3874284c78ba9aa41a7/beacon_node/beacon_chain/src/pre_finalization_cache.rs#L75-L79

This is false in the case where:

A block B arrives and is processed & written to disk.
The node is uncleanly killed (sigkill, OOM, power outage) before the fork choice data structure containing B is written to disk.

The result is that after a restart B is on disk but not in fork choice. This is OK from a DB invariant PoV, because the invariant we maintain is (intentionally) one directional:

block in fork_choice --> block in database

The pre-finalization cache is making an additional bad assumption that the block must only be on disk because it was fully imported and then pruned from fork choice.

A similar assumption is made here:

https://github.com/sigp/lighthouse/blob/6329042628ea0afcbcbce3874284c78ba9aa41a7/beacon_node/beacon_chain/src/block_verification.rs#L1765-L1781

It's not true that the parent must be pre-finalization or conflicting with finalization if it is present on disk. It could be the same as above where the parent was imported fully and written to disk, but forgotten from fork choice.

Version

Lighthouse v5.3.0

Steps to resolve

Rather than making assumptions about blocks based on their presence on disk, we could use a weaker and more defensive check to catch blocks that are definitely finalized or conflicting with finalization.

One such check for the pre-finalization cache would be:

If the block_root at the block's slot on the canonical chain equals the block root of the block, then it is finalized.

This should just involve a single database lookup from the freezer database in most cases, although it could be slow during periods of non-finality when we must iterate backwards across multiple beacon states. This check has the downside of not catching blocks from side chains which have been pruned due to finality. Let's enumerate all the cases so we don't miss any.

Case A: block is canonical and finalized

This is the "easy" case. We know that a block is finalized if its block root matches the block root from the canonical chain at that slot. We can reject these blocks if they are received on gossip or RPC, because we have already imported them.

Case B: parent is canonical and finalized

If a block's parent is a canonical block from the finalized chain (like case A), and is not the finalized block i.e. the most recently finalized block, then it is in conflict with finalization and should be rejected. This case could be checked in a similar way to (A) by using the canonical block roots iterator. However it is potentially inefficient, because we do not know how much older

Case C: multi-block sidechain starting prior to finalization

Case B is a special case of a sidechain starting from a pre-finalization block. In the case of B we have some chance of checking quickly because we can look up the parent. However if the parent is not known because the sidechain consisted of multiple blocks (and the parent has since been pruned), then there's no check we can do unless we download the sidechain again and reject it (from sync -- see alternative 1), or remember all rejected & pruned block roots (see alternative 2).

Alternative 1: use `sync`

It feels like we're duplicating some logic from sync here. Perhaps we could offload this to sync and let it blacklist old side chains which fail to process.

Alternative 2: remember block roots of pruned/rejected blocks on disk and in memory

A different approach would be to store the block roots of all pruned & rejected blocks on disk. This would allow us to quickly reject any block which has previously been rejected, or builds on a rejected block, without having do complicated inference based on the canonical chain.

michaelsproul commented 1 day ago

cc @dapplion this may interest you

michaelsproul commented 1 day ago

thanks to @pawanjay176 for flagging this bug too!

sigp / lighthouse