spacemeshos / SMIPS

Spacemesh Improvement Proposals
https://spacemesh.io
Creative Commons Zero v1.0 Universal
7 stars 1 forks source link

SMIP: mesh data sync #51

Closed countvonzero closed 1 year ago

countvonzero commented 3 years ago

overview

mesh sync is the functionality that guarantees that a node in one of the following conditions can join the Spacemesh network and starts earning reward within “reasonable” amounts of time.

once a node is “synced”, it has the same mesh hash and state root with the rest of the network and only then can it start participating in the consensus protocol and earning rewards.

goals and motivation

correctness

speed

high-level design

mesh sync speed

as the mesh size increases over time, it is reasonable to assume that syncing from genesis will become a time-consuming process. as of march 2021, it takes 8-10 days for an Ethereum node to do a full sync (from genesis). since Spacemesh is set to have at least an order of magnitude greater maximum transaction throughput, one can reasonably expect that we will have to provide a faster way to sync a node before long after the mainnet launch.

based on Geth and Parity Ethereum client development, we may consider the following approaches for mesh sync.

the speed aspect of sync improvement is beyond the scope of this SMIP, which will focus on the correct behavior of the sync component first.

mesh correctness

the content in this section expands on SMIP https://github.com/spacemeshos/SMIPS/issues/43.

current code - state transition

the sync component has 3 states:

below is the state transition diagram.

state-transition

today go-spacemesh implements this state transition diagram.

current code - control flow

to ensure all nodes reach the same mesh hash and state root once synced, we want to implement the following control flow diagram.

sync-cfg-v2

current implementation does everything except for the salmon-pink colored boxes. i.e. resolving hash differences with the peers. the bold green arrows describes the steady-state flow.

current code - components behavior

this diagram describes the interaction between spacemesh components.

go-spacemesh-chart

note that in the diagram above, there are two categories of gossip:

the rationale for not gossiping while syncing many layers is: before accepting a message, we need to download all the messages it references. because data arriving via gossip does not observe any particular, enforced order, listening to gossip while syncing is inefficient and most likely will cause memory blowup.

the table below summarizes whether a component should be active in each sync state (“ideally” columns) and the current state of the code (“now” columns).

notSynced notSynced gossipSync gossipSync synced synced
ideally now ideally now ideally now
publish ATXs N N N N Y Y
build blocks N N N N Y Y
gossip blocks/TXs/ATXs N N Y Y Y Y
gossip PoET proofs Y Y Y Y Y Y
gossip hare protocol maybe maybe maybe maybe Y Y
tortoise beacon N N N N Y Y
verifying tortoise Y Y Y Y Y Y
self healing Y Y Y Y Y Y

proposed implementation

mesh hash reconciliation

at the beginning of each sync round, this process checks if the node and its peers agree on the mesh hash at the most recent layer. if not, it starts the reconciliation protocol with peers. the outcome of the protocol is that all peers reach consensus on the mesh hash. the resolution flow is described below:

hash-resolution-cfg-v2

find divergence layer

a binary search is conducted between two peers to find the first layer of the mesh hash they disagree on.

exchange ATXs

the goal of this phase is to exchange the ATXs set difference between two peers. the naive approach is to sync ATXs with the peer epoch by epoch. an potential optimization is to adopt Graphene set reconciliation protocol.

what next?

for this SMIP, we do the naive and inefficient approach for the network to converge to the same mesh hash. basically as soon as we get the new data (ATXs and voting blocks) from the peers, we immediately run verifying tortoise over the new data, layer by layer, until the node agrees with its peers on the mesh hash.

further improvement on performance is described in a followup SMIP

questions

Q: during mesh hash resolution, if all peers do not agree, then how is this resolved? drop the minority peer(s)? A: say a node is malicious and does not engage in has resolution, a honest node needs to drop the peer and blacklist it.

Q: why would we process some gossip messages, but not others, depending on sync state? bad design?

Q: is it safe to gossip about PoET proof in notSynced? gossipSync? A: PoET proof messages can be validated without context. so it’s safe to process and gossip about when the node is not synced.

Q: should a node participate in tortoise beacon in notSynced? gossipSync? A: a node participates in the tortoise beacon in gossipSync and synced state.

Q: can verifying tortoise work in notSynced? gossipSync?

Q: can the node do self-healing in notSynced? gossipSync?

dependencies and interactions

hare

fetcher

stakeholders and reviewers

@antonlerner @countvonzero

testing and performance

unit test CI tests

countvonzero commented 3 years ago

notes from 07/07/21 R&D meeting

the hash resolution protocol presents a DoS risk. the adversary can share an arbitrarily old block and cause all honest nodes to binary search with peers through history and go out-of-sync. therefore, the node needs to be presented with evidence that the peer has enough accumulated weight that voted for the old block in question. and only then can the node accept the old block and attempt to verify the block.

pre unified content block

at current (last-ticked) layer 1000, node Q found that its mesh hash is different from P. it initiated a binary search with P until the divergence layer is found at layer 100 (epoch 3). P presented Q with a block B at layer 100.

at this time Q asks P for ATXs@epoch 2. if, compared with Q’s own ATXs@epoch2, the total weight of P’s ATXs@epoch 2 is

note that the total weight of the epoch ATXs is an approximation of the accumulated weight of blocks that voted for block B in P’s local view.

Q: under what conditions can the ignore list of blocks/ATXs change node’s local opinion A: consider the network being split into 3 zones X, Y, Z. node Q is in zone X and node P in zone Y. when zones X and Y rejoin, P’s ATXs may not weigh enough to change Q’s local opinion and goes into the ignore list. when zone Z rejoins with X and Y, node Q connects with node R from zone Z, R also does not have enough weight to change Q’s local opinion and its ATXs also goes into the ignore list. however, P’s and R’s combined total weight turns out to be enough to either change Q’s local opinion, or send Q into self-healing mode if the verifying tortoise fails. in this case, we need to be able to detect it.

unified content block

terminology

for each layer, the hare takes all proposal blocks S as input, reaches consensus about the set of final proposal blocks K. it then runs an algorithm to select a subset of TXs from all TXs in K, along the way generating rewards for the voting blocks, and creates an unified content block as the hare output.

Q: how does this change the syncer? A: not much. instead of syncing old-style blocks (with TXs and votes), syncer now syncs unified content blocks and voting blocks (for hare).

Q: how does this change the hash resolution process? A: the only part that changes is how to determine whether a unified content block is valid or not. but this function IsBlockValid() should be a black box for syncer.

Q: how to determine validity for the unified content block? A: TBD

conclusion

for now

for the ignore list of block/ATXs that we discounted, there should be a periodic check to see whether they accumulated enough weight to cause the tortoise to change its local opinion, or fail, in which case it triggers self-healing.

also sync the ignore list with peers with advise not to apply them implement a dummy IsBlockValid() with the correct parameters that always returns true.

for mainnet

implement IsBlockValid for real.

how components behave in different sync state

ATX/PoET proofs

for messages that can be locally verified without historical data, the node should always gossip about them. such messages are ATXs and PoET proofs. PoET proofs are shared and hence has a separate gossip channel from ATXs.

optimization: only gossip PoET proofs that are referenced in ATXs.

hare

best effort approach: the node only participates in the hare protocol when in synced state. once it starts, it will continue until the protocol terminates, regardless of the sync state of the node.

optimization: if the node is not synced, it can just skip the pre-round, where a valid active set is required, but participate in other rounds of the hare protocol.

verifying tortoise

should work in all states, as it’s just tallying votes.

tortoise beacon / self-healing

as long as the node has the initial proposal of the tortoise beacon, it should be able to continue to participate in the tortoise beacon protocol even if the node is not synced.

one of the reasons that can cause a node to go out-of-sync is when it disagrees with peers about the mesh hash. but the disagreement is a given in the self-healing state. this means self-healing needs to operate normally even when the node is out-of-sync. for self-healing to work we need the tortoise beacon to operate and the node to continue gossiping blocks.

considering the cases above, this means we cannot simply turn off gossip when the node goes out-of-sync. more generally, we need more granularity for the sync state.

state sync vs message sync

terminology

the gossip layer can tell its subscribers that the node is message-synced or not. the syncer can tell its subscribers that the node is state-synced or not. a component then combines the information from the gossip layer and syncer to make its local decisions on what it can and cannot do.

for example, when the node is not state-synced, tortoise beacon can continue to operate if it is message-synced, provided it has the initial proposal. and blocks will be gossiped in self-healing mode.

optimization: keep a cache of recent gossip messages to help a node that went offline temporarily to quickly back into message-synced state. for example, a node that missed the initial tortoise beacon proposal can grab this proposal from a peer’s cache and participate in the beacon protocol.

countvonzero commented 3 years ago

notes from 7/12/21 research meeting

assuming nodes prune contextually invalid blocks, or non-rewarded voting blocks and keeps all ATXs, if two nodes agree on accumulated hash, and no one proposes any new rewarded voting blocks, then we are still in agreement and the mesh hash is still valid

proposed sync protocol

prune contextual invalid blocks and keep all the ATXs. when presented with an old ATX, count the total weight and see if the verifying tortoise succeeds. if not, then run the vote counting tortoise with the data that you have, assuming neighbors didn’t produce new data.

if a neighbor does present new data, a non-rewarded voting blocks, and vote counting tortoise changes its opinion, then we need to change mesh history.

conclusion

not sure whether pruning contextual invalid blocks is secure. TBD.

countvonzero commented 3 years ago

notes from 07/19/21 research meeting

sync ATXs before anything else

blocks/hare eligibility rely on ATXs. ATXs, on the other hand, only rely on previous ATXs. syncing all ATXs (in epoch order) before syncing other data may be beneficial to the node. for example, it can start gossiping about ATXs before the state is synced.

Q: can we start hare protocol once all ATXs are synced? A: hare active set is extracted from ATXs in the blocks and therefore still relies on blocks being synced.

pruning contextual invalid blocks

it may not be safe to prune contextual invalid blocks. assuming a 2-way network split of 60/40. the 60-side then splits further into 35/25. later 35-side rejoins with 40-side first. 35-side adopts 40-side's view and prune its now contextual invalid blocks. 25-side later rejoins and did the same. now the network is in consensus on the originally 40-side's view.

we may want to have a higher threshold for pruning than tortoise confidence threshold. we may lose some data in the extreme cases, but at least the network is still in consensus.

countvonzero commented 3 years ago

proposed sync protocol w.r.t accumulated hash resolution

scenario A: i just got back online (catch-up sync). my last synced layer is N

in this case, i'll first check that my last accumulated layer hash at layer N is the same as my peers.

A.1: layer N hash differ

we engage in a binary search to find the first layer we differ, say M.

A.1.1: if the peer has more ATXs total weight btwn layer M and N

i adopt the peer's mesh, and put the diff blocks/ATXs between two meshes in the ignore list. i then do a wholesale sync from the peer starting from M

A.1.2: if i have more ATXs total weight than my peer btwn layer M and N

my peer should adopt my view and put the diff blocks/ATXs in its ignore list. the peer then adopt my view from M to N. from there i do a wholesale sync from peer starting from layer N

note: when exchanging ATXs since layer N, we exchange ATX hashes for each epoch and diff peer’s ATXs with mine. here i imagine we can sort by hashes and diff two sorted list. the timestamp we keep for ATX locally is the timestamp when we receive the ATX, either via sync or via gossip. so i not sure if timestamp helps in this case.

note: wholesale sync is syncing valid ATXs/blocks and ignore lists of ATXs/blocks

A.2. layer N hash agree

i simply do a wholesale sync from the peer starting from layer N

scenario B: i am synced and got an old ATX/block via gossip

i check if this ATX hash is in the local DB. if not, put it in the ATX ignore list. same for the old block.

scenario C: i am synced and received many old ATXs directly from a peer

presumably during mesh hash resolution. i diff the ATXs with mine locally and decide whether to adopt my peer’s view or not.

note: there should be a periodic process checking whether the ignore list of ATXs/blocks can change verifying tortoise’s opinion.

lrettig commented 3 years ago

note: there should be a periodic process checking whether the ignore list of ATXs/blocks can change verifying tortoise’s opinion.

Note: this has already been implemented in self-healing (though it could be improved, see https://github.com/spacemeshos/go-spacemesh/issues/2551)

countvonzero commented 3 years ago

Note: this has already been implemented in self-healing (though it could be improved, see spacemeshos/go-spacemesh#2551)

thanks for the update @lrettig so what's left to do for the periodic recheck of tortoise opinion from genesis is to consider the (not-yet-created) ignore lists for ATXs/blocks. is that correct?

lrettig commented 3 years ago

I don't fully understand the "ignore list" concept but what remains to be done is proper accounting of weights. As @tal-m has explained, if the verifying tortoise finalizes a layer, there must be overwhelming support in favor of the valid blocks in that layer, so it's safe to ignore a few late blocks here and there, but at some point, esp. in a scenario such as a deep reorg, the accumulated weight of older blocks/votes could in theory override the previous tortoise opinion for a verified layer.

countvonzero commented 3 years ago

it's safe to ignore a few late blocks here and there

right. when we receive the late blocks that are not significant enough to change verifying tortoise's opinion, we need to save those late blocks in a ignore list. the rationale is that, over time, those late blocks may add up enough to change verifying tortoise's opinion. same for late ATXs.

if we discover, through syncer's hash reconciliation process, that we missed out some old ATXs/blocks, we don't immediately trigger verifying tortoise for the old data. we put them in the ignore list until they become significant enough to change verifying tortoise's opinion. in that case, we rerun verifying tortoise and may trigger self healing. this is exactly what the thread needs to do, check whether the data in the ignore list become significant enough. i think this is one of the reorg scenarios.