SMIP: mesh state sync - Githubissues

overview

a previous SMIP: mesh data sync describes a naive and inefficient approach on how a node syncs with the peers when it discovers that it has different mesh hash with its peers. the previous naive approach syncs all data from its peers and re-run verifying tortoise on the new data. this is extremely inefficient and negates the efficiency optimization that takes verifying tortoise's complexity from quadratic to linear.

this SMIP will concentrate on how to avoid re-running verifying tortoise when we receive new data from peers.

goals and motivation

all peers end up with the same mesh hash
avoid triggering verifying tortoise unless it's necessary

in order to retain the efficiency optimization that takes verifying tortoise's complexity from quadratic to linear, we want to avoid re-running verifying tortoise for a given layer. this is similar to the idea that we don't want to trigger verifying tortoise every time we receive a new late ATX/block. (https://github.com/spacemeshos/go-spacemesh/issues/2551).

high-level design

the diagram below describes the work flow.

hash-resolution-cfg-v3

Proposed implementation

step 1: sync mesh data

at the beginning of each sync round, this process checks if the node and its peers agree on the mesh hash at the most recent layer. if not, it starts the reconciliation protocol with peers. the outcome of the protocol is that all peers reach consensus on the mesh hash.

find divergence layer (should be done by the previous SMIP SMIP: mesh data sync )

a binary search is conducted between two peers to find the first layer of the mesh hash they disagree on.

exchange ATXs

the goal of this phase is to exchange the ATXs set difference between two peers. the naive approach is to sync ATXs with the peer epoch by epoch. an potential optimization is to adopt Graphene set reconciliation protocol. see the discussion below.

weigh new ATXs

after we gather the union of ATXs of our peers, we consider the new ATXs received from peers and decide whether we should download the voting blocks that reference these ATXs.

the total weight units of the new ATXs is NOT significant enough: we simply drop these new ATXs in the ignore list and move on. in this case, we are still confident about our mesh and the disagreeing peers should converge to our mesh hash.
the total weight units of the new ATXs is significant enough that we are no longer confident: in this case we download the voting blocks referencing these ATXs and drop them in the ignore list. a separate thread will pick up these new data and determine how we should converge to the same mesh hash with our peers, as shown in the right-hand-side of the diagram.

the ignore lists are used to keep the ATXs/blocks we received after verifying tortoise has run for the layer so the node knows the total amount of weight units it is ignoring. it is likely that these weight units accumulated over time and become significant enough that can cause verifying tortoise to change its opinion. as such, there should be a periodic process checking whether the ignore list of ATXs/blocks can change verifying tortoise’s opinion.

step 2: sync mesh state

a separate thread in the system (likely piggyback on the implementation of https://github.com/spacemeshos/go-spacemesh/issues/2551) will periodically check the ignore list of ATXs/voting blocks to decide whether they are significant enough to change verifying tortoise's opinion. there are two possible outcomes:

the votes in the ignore list are not significant enough and we continue to ignore them: in this case we do nothing.
the votes in the ignore list are significant and we need to change our opinions of the mesh contents: since the protocol guarantees that, given the same data, when any two nodes are both confident about their mesh content, they must have the same mesh hash. we will be faced with two scenarios:
- at least one peer remains confident: reset the mesh and sync from that peer
- no peers remain confident: all peers run verifying tortoise on all data and possibly self-heal.

sync from a confident peer

syncing from a confident peer means to wholesale sync the peer's layer from the divergent layer, including its ATXs, content blocks, voting blocks, and the ignore lists of ATXs and voting blocks. this is essentially a reorg, where we reset the mesh state to the divergent layer and sync from the confident peer.

no confident peer in the network

when we cannot find a confident peer in the network, we need to run verifying tortoise on all available data, including the data in the mesh and the data in the ignore list to reconstruct the mesh. all nodes should be doing the same thing on the same set of data and potentially go into self-healing.

discussion on Graphene

Graphene is an interactive protocol that does set reconciliation between two peers' mempool and the newly created block, while minimizing data over the wire. while it seems highly suitable for the exchanging ATXs between two peers.

Graphene assumes there is a max number of set difference. if the set difference exceed the number, Graphene simply returns an error. an adversary can force its peer to run this protocol indefinitely while upping the max number of difference for each iteration indefinitely.
Graphene is currently implemented only in C++/python. we will need to write a go version of Graphene for go-spacemesh.

we could potentially modify the protocol so that for each failed iteration where the set difference exceeds the max number, the protocol presents each peer with some evidence of the difference to allow the node to verify before continuing to engage in the protocol.

spacemeshos / SMIPS

SMIP: mesh state sync #60