Overview

This SMIP is a proposal to fully implement self-healing on top of the existing consensus code (Hare, verifying Tortoise). Self-healing is a critical part of the core Spacemesh consensus protocol.

Goals and motivation

Self-healing allows the network to recover from any state that results from a temporary violation of our assumptions, once our assumptions once again hold. In general, we want self-healing to only be triggered when absolutely necessary, to run as quickly and efficiently as possible (i.e., consume the minimum required resources and return to "non-healing mode" as quickly as possible), and be as transparent as possible to users and node operators.

High-level design

In "ordinary" (non-healing) mode, Hare successfully produces results for each layer, the block builder uses the Hare output to calculate the votes for each newly created block (by selecting a base block and adding a list of exceptions), and verifying Tortoise validates complete layers on the basis that a. the total weight of the votes of blocks in hdist following layers ("global opinion") exceeds a confidence threshold, b. does not disagree with the Hare output ("input vector"), and c. does not contain an "abstain" vote for any blocks in that layer (in practice, "abstain" votes are only ever for an entire layer, and only while waiting for Hare to finish for that layer). Ordinarily, the output of the Hare always agrees with the global opinion on each block, and Hare always finishes on time, so verifying Tortoise always succeeds in verifying new layers quickly.

While waiting for Hare to finish running for a given layer, all nodes explicitly abstain from voting on that layer (i.e., on all of its blocks), and verifying Tortoise won't verify the layer. If Hare fails for a layer, all nodes explicitly vote against all blocks in that layer, and the layer is verified without any blocks (i.e., any blocks it contains should be marked contextually invalid, and the layer is deemed empty).

If the global opinion on a block differs from the input vector, then the verifying tortoise effectively fails and gets stuck since the assumptions it relies on have failed. This is the scenario where self-healing needs to be triggered.

Proposed implementation

We introduce a new parameter, zdist, which is the distance (in layers) that a node will wait for Hare results before giving up. Verifying Tortoise is considered to have failed when it is unable to verify a layer older than zdist + n layers, where n is a parameter that specifies the amount of time (in layers), after a node has established its opinion on a layer (either based on the Hare output, or the fact that Hare failed/timed out), after which all nodes should have full confidence about a layer and be able to verify it.

When the verifying Tortoise fails, it triggers the "slow" (full, vote counting) Tortoise for the interval from the last verified layer to the failed layer. Unlike the verifying Tortoise, which only counts the votes of blocks it considers "good" (i.e., those whose votes agree with the local opinion), the slow Tortoise counts all votes in all blocks and does not rely upon the input vector at all. The opinion of the slow Tortoise is definitive: if it contradicts the verifying Tortoise (on the contextual validity of a given block), it effectively rewrites history, which will require a state reversion (akin to a reorg).

When block builder constructs new blocks, for recent layers for which Hare results are available, it votes according to the Hare results. For layers up to zdist for which Hare results are unavailable but may still become available later (i.e., Hare is still running for those layers, and has not failed), it abstains. For layers older than zdist, it explicitly votes against all blocks in the layer (i.e., it votes for an empty layer).

We also introduce a new vector, tortoiseOpinion, which stores the most up to date opinion on every block. These opinions are stored as net values, where positive values indicate support, negative values indicate opposition and zero means neutral/abstain. This vector is updated every time verifying Tortoise or slow Tortoise runs. For blocks up to hdist layers back, block builder votes according to the logic above: based on Hare results, or for an empty layer if Hare has failed/timed out. For blocks older than hdist, it always votes according to tortoiseOpinion. In the case where the net opinion on a block does not exceed the threshold (either for or against), it votes according to the weak coin for that layer (i.e., the layer of the block being constructed) instead.

Implementation plan

[x] Implement weak coin based on Hare VRF (https://github.com/spacemeshos/go-spacemesh/pull/2393)
[x] Remove circular dependencies from Hare eligibility
- [x] Use Tortoise active set (https://github.com/spacemeshos/go-spacemesh/pull/2357)
- [x] Use Tortoise beacon (https://github.com/spacemeshos/go-spacemesh/pull/2394)
[x] Ensure block builder works as expected (votes abstain while waiting for Hare results for a layer, then switches to explicit vote against all blocks in the layer)
[x] Add tests that will cause verifying Tortoise to fail, and that should trigger slow Tortoise.
[x] Restore/add full (slow, vote counting) Tortoise code. Modify it to run from the last verified layer.
[x] Modify slow Tortoise to read weak coin when global opinion vote threshold isn't passed.
[x] Modify verifying tortoise logic to detect when it has failed, and hand off computation to the slow Tortoise in this case. Finalize interface between verifying Tortoise and slow Tortoise: do not run verifying Tortoise while slow Tortoise is running, until slow Tortoise knows for sure that verifying Tortoise will start working again from a given layer. Then hand control back to the verifying Tortoise. Make sure it re-runs for any layers for which its data might have changed as a result of self-healing.
[x] Add support for state reversion (already partially in place). Detect differences between slow Tortoise output and previous state, and intelligently apply the difference (as a reorg).
[x] Modify block builder to check Hare results for layers up to hdist, then check the Tortoise-maintained opinion vector after that.
[x] Add support to rerun verifying Tortoise from genesis once in a while, either time-based (e.g., every ten mins, or just make it run continually in the background, depending how long it takes) or else accounting-based (when we may have accumulated enough possible changes to old data to make a difference)

Questions

Why does self-healing require a full (slow, vote counting) Tortoise (as opposed to the verifying Tortoise)? Answer: Verifying Tortoise is designed to work only as long as the global opinion matches the local opinion (Hare output, a.k.a., the input vector).
Does a Hare failure ever trigger self-healing? If Hare fails for a short while (< zdist layers) due to a temporary violation of assumptions, self-healing should not be necessary since Hare will start working again once assumptions hold again. For instance, if the synchrony assumption is temporarily violated and many Hare messages arrive late, Hare may fail for several layers, but Hare will begin working again when synchrony is restored. In the interim, newly-created blocks will vote neutral while they wait for Hare results for the interim layers. If Hare fails for a long while (> zdist layers), the block builder will give up waiting for Hare results for older layers and will begin to vote for empty layers. This will cause any blocks in those layers to be marked contextually invalid and their transactions to be dropped, but verifying Tortoise should still be able to verify these layers. When assumptions are restored, once again, Hare should begin working again and blocks and transactions will be validated once more. (Note: There is presently a circular dependency, since the Hare beacon relies upon the Hare output for a previous "safe layer", but this is temporary and will go away once the Tortoise beacon is working. Then Hare will rely only upon Tortoise.) Answer: Even with Hare termination certification, there could still be disagreement due to violation of assumptions (e.g., short-term dishonest majority). In most cases, self-healing should not be required for Hare to begin working again, but it theoretically could be.
In what scenario would "global opinion" differ from our vote on a block? This means that the counted votes disagree with the input vector, i.e., Hare output. How could this happen? Only in an adversarial scenario, e.g., a balancing attack? Or in a scenario where we disagree whether Hare finished? (Wouldn't Hare termination certification address the latter?) Answer: Input vector doesn't only come from Hare output. While syncing, it's received from peers. So in addition to disagreement about Hare results (see previous bullet point), you could get bad data from your peers, or there could be a network partition leading to different opinions about history, and no (dominant) global opinion.
What, exactly, are the criteria for triggering self-healing? Do we trigger self-healing on a single block in a single layer with a different opinion? Or is there a threshold in terms of number of blocks, number of layers, layer age, or something else? Shouldn't we wait at least hdist layers, i.e., until we know for sure that Hare has failed for this layer for good? Answer: Wait zdist layers, plus an additional buffer (a period of time that a node is willing to wait for global opinion to be established before verifying a layer). If, after this point, verification fails, then self-healing must run.
Should an abstain vote for a layer ever trigger self-healing? Answer: No, we only expect to see an abstain while waiting for Hare results for a layer, and then after zdist the vote must "resolve" into either support or against.
Can we ever verify layer n+1 if layer n is currently marked abstain? Or should we give up and keep waiting? Answer: Verification of each layer is totally independent. While we can verify layer n+1 while waiting to verify layer n, we cannot apply its state. (The answer is slightly more nuanced than this: a conservative portion of the state could be optimistically applied, e.g., if Alice has 10 coins in her wallet as of layer n, and doesn't spend any of them in layer n, then sends 5 to Bob in layer n+1, and layer n+1 is verified, then Alice and Bob can both be confident that the transaction is validated.)
Do we want to periodically trigger self-healing even without a failure of the verifying Tortoise? As a "safety check" mechanism of sorts. Answer: We want to periodically rerun the verifying Tortoise from genesis. There are two ways this might be done. The first is to account for accumulated weight, and trigger this only when accumulated weight exceeds a threshold, over which it has the potential to actually change history. A node could account for this weight by counting the weight of blocks that vote for things outside (further back than) verifying Tortoise's sliding lookback window. A simpler approach is just to rerun periodically, or to always have a verifying Tortoise process rerunning from genesis in the background. Important note: Rerunning from genesis still uses a sliding window, and moves it forward. It's the same thing that a node does when syncing from scratch (and might even use the same code path).
Do we ever run self-healing all the way from genesis? Or do we only run it from the previously verified layer? If the latter, how do we "interface" the two? E.g., does the slow Tortoise use blocks from the previously verified layer as base blocks? How exactly would this work? Answer: No. Slow Tortoise should start running at the first layer where verifying Tortoise failed. This is because the validity of a block in layer n only depends upon blocks in layers > n, and we don't care at all about what came before.
What's the algorithm for the slow Tortoise? It should be very similar to the verifying Tortoise, with two exceptions: 1. no input vector (i.e., relies only upon global opinion), and 2. looks back further - all the way to genesis? Or to some intermediate checkpoint? Answer: It constructs a triangular matrix, where rows are votes for blocks, and columns are all of the votes by a block. It maintains a sliding window, large enough that whp if you pass through this window then everyone is in consensus. When a new block arrives, copy the column from the base block and then apply the exception list. Also, maintain the tortoiseOpinion vector, which is effectively just a sum of all the rows in the matrix.
What is the interface between block builder and self-healing? Up to hdist layers back, it uses the Hare result for voting, as described above. Further back than this, it votes on the basis of tortoiseOpinion, using the weak coin when the margin is too close.
What is the API with weak coin? Very straightforward: when constructing a block for layer n, just use the weak coin result for layer n.
Can slow Tortoise/self-healing ever change what verifying Tortoise told us? I thought verifying Tortoise could only ever report what slow Tortoise would report, but on the last call we discussed how "verifying Tortoise ignores 'really old blocks' and at some point we need to say, we ignored so much stuff that it could actually change the results." Answer: No. But rerunning verifying Tortoise from genesis, with new information, could cause it to change its opinion (see question above, on this).

Dependencies and interactions

Verifying tortoise (triggers the slow Tortoise and self-healing)
Tortoise beacon (is there a dependency on this?)
Hare (Weak coin depends on Hare VRF. Note that there is no direct dependency between Hare and self-healing, e.g., Hare failure does not directly trigger self-healing.)
Block builder (needs to account for slow Tortoise output when adding votes to blocks)
State (state trie needs to support state reversions: rewind to a previous, verified layer, then reapply state forward)

Stakeholders and reviewers

TBD

Testing and performance

TBD

Note that the slow Tortoise is, well, slow and inefficient. There are several proposals for ways to make it more efficient. One option is to add encoding checkpoints.

spacemeshos / SMIPS

Self-healing #46