spacemeshos / SMIPS

Spacemesh Improvement Proposals
https://spacemesh.io
Creative Commons Zero v1.0 Universal
7 stars 1 forks source link

SMIP: mesh state sync #60

Open countvonzero opened 2 years ago

countvonzero commented 2 years ago

overview

a previous SMIP: mesh data sync describes a naive and inefficient approach on how a node syncs with the peers when it discovers that it has different mesh hash with its peers. the previous naive approach syncs all data from its peers and re-run verifying tortoise on the new data. this is extremely inefficient and negates the efficiency optimization that takes verifying tortoise's complexity from quadratic to linear.

this SMIP will concentrate on how to avoid re-running verifying tortoise when we receive new data from peers.

goals and motivation

in order to retain the efficiency optimization that takes verifying tortoise's complexity from quadratic to linear, we want to avoid re-running verifying tortoise for a given layer. this is similar to the idea that we don't want to trigger verifying tortoise every time we receive a new late ATX/block. (https://github.com/spacemeshos/go-spacemesh/issues/2551).

high-level design

the diagram below describes the work flow.

hash-resolution-cfg-v3

Proposed implementation

step 1: sync mesh data

at the beginning of each sync round, this process checks if the node and its peers agree on the mesh hash at the most recent layer. if not, it starts the reconciliation protocol with peers. the outcome of the protocol is that all peers reach consensus on the mesh hash.

find divergence layer (should be done by the previous SMIP SMIP: mesh data sync )

a binary search is conducted between two peers to find the first layer of the mesh hash they disagree on.

exchange ATXs

the goal of this phase is to exchange the ATXs set difference between two peers. the naive approach is to sync ATXs with the peer epoch by epoch. an potential optimization is to adopt Graphene set reconciliation protocol. see the discussion below.

weigh new ATXs

after we gather the union of ATXs of our peers, we consider the new ATXs received from peers and decide whether we should download the voting blocks that reference these ATXs.

the ignore lists are used to keep the ATXs/blocks we received after verifying tortoise has run for the layer so the node knows the total amount of weight units it is ignoring. it is likely that these weight units accumulated over time and become significant enough that can cause verifying tortoise to change its opinion. as such, there should be a periodic process checking whether the ignore list of ATXs/blocks can change verifying tortoise’s opinion.

step 2: sync mesh state

a separate thread in the system (likely piggyback on the implementation of https://github.com/spacemeshos/go-spacemesh/issues/2551) will periodically check the ignore list of ATXs/voting blocks to decide whether they are significant enough to change verifying tortoise's opinion. there are two possible outcomes:

sync from a confident peer

syncing from a confident peer means to wholesale sync the peer's layer from the divergent layer, including its ATXs, content blocks, voting blocks, and the ignore lists of ATXs and voting blocks. this is essentially a reorg, where we reset the mesh state to the divergent layer and sync from the confident peer.

no confident peer in the network

when we cannot find a confident peer in the network, we need to run verifying tortoise on all available data, including the data in the mesh and the data in the ignore list to reconstruct the mesh. all nodes should be doing the same thing on the same set of data and potentially go into self-healing.

discussion on Graphene

Graphene is an interactive protocol that does set reconciliation between two peers' mempool and the newly created block, while minimizing data over the wire. while it seems highly suitable for the exchanging ATXs between two peers.

we could potentially modify the protocol so that for each failed iteration where the set difference exceeds the max number, the protocol presents each peer with some evidence of the difference to allow the node to verify before continuing to engage in the protocol.

Implementation plan

Questions

Dependencies and interactions

Stakeholders and reviewers

Testing and performance