spacemeshos / SMIPS

Spacemesh Improvement Proposals
https://spacemesh.io
Creative Commons Zero v1.0 Universal
7 stars 1 forks source link

Self-healing #46

Closed lrettig closed 1 year ago

lrettig commented 3 years ago

Overview

This SMIP is a proposal to fully implement self-healing on top of the existing consensus code (Hare, verifying Tortoise). Self-healing is a critical part of the core Spacemesh consensus protocol.

Goals and motivation

Self-healing allows the network to recover from any state that results from a temporary violation of our assumptions, once our assumptions once again hold. In general, we want self-healing to only be triggered when absolutely necessary, to run as quickly and efficiently as possible (i.e., consume the minimum required resources and return to "non-healing mode" as quickly as possible), and be as transparent as possible to users and node operators.

High-level design

In "ordinary" (non-healing) mode, Hare successfully produces results for each layer, the block builder uses the Hare output to calculate the votes for each newly created block (by selecting a base block and adding a list of exceptions), and verifying Tortoise validates complete layers on the basis that a. the total weight of the votes of blocks in hdist following layers ("global opinion") exceeds a confidence threshold, b. does not disagree with the Hare output ("input vector"), and c. does not contain an "abstain" vote for any blocks in that layer (in practice, "abstain" votes are only ever for an entire layer, and only while waiting for Hare to finish for that layer). Ordinarily, the output of the Hare always agrees with the global opinion on each block, and Hare always finishes on time, so verifying Tortoise always succeeds in verifying new layers quickly.

While waiting for Hare to finish running for a given layer, all nodes explicitly abstain from voting on that layer (i.e., on all of its blocks), and verifying Tortoise won't verify the layer. If Hare fails for a layer, all nodes explicitly vote against all blocks in that layer, and the layer is verified without any blocks (i.e., any blocks it contains should be marked contextually invalid, and the layer is deemed empty).

If the global opinion on a block differs from the input vector, then the verifying tortoise effectively fails and gets stuck since the assumptions it relies on have failed. This is the scenario where self-healing needs to be triggered.

Proposed implementation

We introduce a new parameter, zdist, which is the distance (in layers) that a node will wait for Hare results before giving up. Verifying Tortoise is considered to have failed when it is unable to verify a layer older than zdist + n layers, where n is a parameter that specifies the amount of time (in layers), after a node has established its opinion on a layer (either based on the Hare output, or the fact that Hare failed/timed out), after which all nodes should have full confidence about a layer and be able to verify it.

When the verifying Tortoise fails, it triggers the "slow" (full, vote counting) Tortoise for the interval from the last verified layer to the failed layer. Unlike the verifying Tortoise, which only counts the votes of blocks it considers "good" (i.e., those whose votes agree with the local opinion), the slow Tortoise counts all votes in all blocks and does not rely upon the input vector at all. The opinion of the slow Tortoise is definitive: if it contradicts the verifying Tortoise (on the contextual validity of a given block), it effectively rewrites history, which will require a state reversion (akin to a reorg).

When block builder constructs new blocks, for recent layers for which Hare results are available, it votes according to the Hare results. For layers up to zdist for which Hare results are unavailable but may still become available later (i.e., Hare is still running for those layers, and has not failed), it abstains. For layers older than zdist, it explicitly votes against all blocks in the layer (i.e., it votes for an empty layer).

We also introduce a new vector, tortoiseOpinion, which stores the most up to date opinion on every block. These opinions are stored as net values, where positive values indicate support, negative values indicate opposition and zero means neutral/abstain. This vector is updated every time verifying Tortoise or slow Tortoise runs. For blocks up to hdist layers back, block builder votes according to the logic above: based on Hare results, or for an empty layer if Hare has failed/timed out. For blocks older than hdist, it always votes according to tortoiseOpinion. In the case where the net opinion on a block does not exceed the threshold (either for or against), it votes according to the weak coin for that layer (i.e., the layer of the block being constructed) instead.

Implementation plan

Questions

Dependencies and interactions

Stakeholders and reviewers

TBD

Testing and performance

TBD

Note that the slow Tortoise is, well, slow and inefficient. There are several proposals for ways to make it more efficient. One option is to add encoding checkpoints.

countvonzero commented 1 year ago

implemented