refactor: epoch change sync to checkpoint

stringhandler commented 4 months ago

Problem

When a registered validator joins the network for the first time, there is a high bandwidth and processing cost which worsens as the network progresses.

Currently, this includes: A. Syncing the whole state of the shard. Including DOWN substates. B. Syncing all historical transactions for the shard/s C. Syncing the entire block history/chain of one or more shards

(A) is unavoidable and should be optimised to reduce time and bandwidth costs.

(B) Should not be necessary at all. The main reason it is there is to prevent duplicate transactions from being processed. There also may be some code that expects the transaction to be available (e.g. in web UIs). However, duplicate transactions are already prevented by the TransactionReceipt substate. A validator node may want to generate an index of these as it syncs to optimise duplicate transaction checking before it fails in the commit phase. Some archival nodes may want to track and store historical transactions, but these are separate concerns and consensus/block producer nodes would generally not do this.

(C) This requires more thought. It is important that new validator node knows that they have the complete and agreed shard state at or close to the end of the previous epoch. Beyond that, knowing which transactions were historically processed in which block by whom is not necessary to proceed with consensus in a new epoch.

Proposal 1

Start a new chain for each new epoch and verify a checkpoint proof.

For a validator to join a shard for an epoch, they request a succinct proof that asserts that two thirds of the validator set for the shard-epoch have committed a given state.

To achieve this the following is done:

During the previous epoch, they contact one or more validators in the shard-epoch, request and verify a checkpoint proof signatures.
Download the state to the checkpoint.
Validate the state hash against the checkpoint proof.
The previous epoch chain has likely progressed over this time. Rince and repeat as necessary until the end of epoch commands are observed

Checkpoint Proof

The exact construction of the proof needs to be thought about more

A shard-epoch checkpoint proof for the current commit block can be generated at any point by a participating validator node. The proof contains the jellyfish Merkle root of the state as well as the last 3 linked QCs of the block that is represented in the proof.

For example, the current tip block is 300, the proof contains the linked QCs of 299, 298, and 297. This proves that block 297 was committed by all non-faulty nodes.

stringhandler commented 4 months ago

Sounds good. Except that I would say they should download the checkpoint at the beginning of the current epoch (or potentially at the last mini checkpoint location in the current epoch) and then replay all blocks since then, up to the current tip.

sdbondi commented 2 months ago

Implemented in #1067

tari-project / tari-dan