Redesign layer status to match new consensus mechanisms

lrettig commented 3 years ago

Currently a layer can have one of three statuses:

https://github.com/spacemeshos/api/blob/105249951c66561cfc52195433ecae9cd5a121ff/proto/spacemesh/v1/types.proto#L112-L116

These statuses no longer map to how our consensus mechanisms actually work. Here's a better, more accurate design:

unspecified
pending: layer is still syncing, or new, or otherwise waiting to be processed/validated
analyzing: hare is currently running for this layer
invalidated: hare failed to run for this layer, and the node has decided it's empty (all blocks are invalid); or, hare succeeded and confirmed an empty layer
tentative: hare succeeded running for the layer, and a node thus has weak confidence that all other nodes agree on its contents
stuck: tortoise tried to verify this layer, but couldn't (because hare hasn't finished yet, or because the global opinion on the layer is abstain and the layer isn't old enough yet to try healing)
skipped: tortoise tried and failed to verify this layer, but moved on and verified a later layer (see https://github.com/spacemeshos/go-spacemesh/issues/2403)
confirmed: tortoise succeeded in verifying the layer
applied: layer state transitions have been applied, receipts generated, etc.
healing: even after a layer is confirmed and its state has been applied, in rare cases, a node may need to re-apply the layer as part of a self-healing process
final: at some point we may want to go one step further and say that the layer is totally final and its state can no longer be updated, even by self-healing (I'm not sure when or if we can say this)

We may not want to surface all of these possible statuses via the API, and this list is not precisely MECE as there is some overlap, but it's reasonably comprehensive.

avive commented 3 years ago

We also need to think about transaction statuses. It is my understanding that while in self-healing, no other data is canonical until the self healing is complete. So, a transaction in block which which is in a layer that is healing will also need to have a tentative state - perhaps it is healing or perhaps it is tentative. We need to carefully consider what's the minimum new set of possible states that will give users a clue regarding the state of a network but on the other hand not have too many states as these are very confusing even for technical people. And the states need to be for all mesh entities... not just layers.

lrettig commented 3 years ago

To be clear, transactions obviously do not have an independent status - they derive their status from the status of their block and layer.

while in self-healing, no other data is canonical until the self healing is complete

What makes self-healing complex, in this context, is that it can invalidate a previously valid block (or vice-versa). So we could have blocks (and transactions) that are "approved" and applied to state, then reverted later. That's why I suggested introducing a "final" status, but we'll have to discuss with @tal-m the threshold beyond which we could apply this.

avive commented 3 years ago

We need to refine this and find a minimal MECE set. For example, why do we need unspecified if we have pending? Obviously we need to find a balance between being descriptive and informative and not confusing users with too many states. I think 7 is the magic number here that above it most people will the states just overwhelming and overly complex. For example, if stuck is a temporary possible state then it can also be pending. One thing to consider is to have all proposed states above until verified by tortoise to be pending and maybe provide more detailed hare-related status in the debugging api service.

Here's a minimalistic proposal for 3 high-level states for layer, block and tx (same states for all 3 entities):

Pending - Including when node determined needs to self heal in order to verify the layer and including unspecified.
Verified- Tortoise verified.
Confirmed - Verified and state applied (txs executed).
Hare related statuses: in debugging api service for tests.

lrettig commented 3 years ago

why do we need unspecified if we have pending?

This is a quirk of how GRPC works (and golang) - there needs to be a default value other than pending so we know whether or not that value has been initialized correctly. It doesn't need to be exposed to the user (if it is, that's a bug).

avive commented 3 years ago

So how about:

 enum LayerStatus { 
     LAYER_STATUS_UNSPECIFIED = 0; // unknown
     LAYER_STATUS_PENDING = 1;       // not yet approved or confirmed 
     LAYER_STATUS_APPROVED = 2;   // approved by hare 
     LAYER_STATUS_VERIFIED = 3;       // approved by tortoise 
     LAYER_STATUS_CONFIRMED = 4; // confirmed by tortoise and state applied
 }

So each state is additional confidence in confirmation compared to the one before it and the last one is the max level of confirmation we have in our system. We still have the question regarding can a verified layer move to pending due to self healing or not.

lrettig commented 3 years ago

we definitely need an "invalid" status, for blocks that were marked contextually invalid (by hare OR by tortoise)
don't we want a "final" status as well?

avive commented 3 years ago

Regarding invalid - I thought we are talking about layer statuses. Yes, for some blocks known to a node I guess that can be invalid if they are not in any valid layer.
Regarding final - this depends on whether self healing can change any block in the past w/o limitations. If yes then there is no final blocks, if no then I guess there are.

lrettig commented 3 years ago

Discussed this with @tal-m today: regarding "final", we have no explicit finality. Finality will be implicit, subjective, and probabilistic, as in Bitcoin. So I think we can drop this status.

avive commented 3 years ago

So after thinking more about this, maybe we go with these high-level layer (and transaction) statuses:

enum LayerStatus { 
     LAYER_STATUS_UNSPECIFIED = 0; // unknown
     LAYER_STATUS_PENDING = 1;       // not yet approved or confirmed 
     LAYER_STATUS_APPROVED = 2;   // approved by hare 
     LAYER_STATUS_VERIFIED = 3;       // approved by tortoise 
     LAYER_STATUS_CONFIRMED = 4; // confirmed by tortoise and state applied for txs in the layer
 }

and have additional sub-statuses regarding hare in lower-level api such as debuggingServices if needed for tests.

lrettig commented 3 years ago

Add invalid to the list and I will agree with you :)

avive commented 3 years ago

Add invalid to the list and I will agree with you :)

How is it different from LAYER_STATUS_UNSPECIFIED?

lrettig commented 3 years ago

Add invalid to the list and I will agree with you :)

How is it different from LAYER_STATUS_UNSPECIFIED?

I explained here. Individual blocks can be invalidated by hare or by tortoise. An entire layer can also be invalidated, e.g., if hare fails completely for that layer, which means that all of the blocks in the layer are marked invalid. Technically we can "verify" or "confirm" an empty layer, so I guess maybe we don't need a separate INVALID status. Do we need an EMPTY status? It can be implied by the nonexistence of any block data in the layer, as long as downstream clients know how to interpret and display empty layers.

spacemeshos / api

Redesign layer status to match new consensus mechanisms #144