Open MonsieurNicolas opened 6 years ago
The only recovery would be to reset the instance a mechanism to "rollback" a validator to the previous ledger to resolve this issue.
This might not be a good idea. It might be better to change the validator's public key, and just pretend you deleted the validator and created a new one.
I think it would help to contextualize this discussion with some examples. I'm wondering what an example would be of an optional invariant failing and the invariant-checking mechanism then allowing a non-disastrous outcome.
This might not be a good idea. It might be better to change the validator's public key, and just pretend you deleted the validator and created a new one.
Yes this is probably better; I am not proposing any of those things anyways in that section (it was for illustration purpose)
I think it would help to contextualize this discussion with some examples. I'm wondering what an example would be of an optional invariant failing and the invariant-checking mechanism then allowing a non-disastrous outcome.
All of our operation level invariants are "soft" right now (notify only), but I could see enabling them as optional invariants (default on) before enabling them as "protocol level invariant" (normal evolution of how to "promote" invariants). The other type of optional invariants that I think we'll start to see are custom invariants that people want to enforce for their token: a token issuer can enforce that only certain things can happen to their token (such as "no new tokens issued for my tokens") with their validator, and people that really care about those tokens can add those validators to their quorum set.
@MonsieurNicolas, do you want to work this into a draft? Re-open the discussion on the mailing list? Hand it off to someone else?
Introduction
This thread is for discussing how we can adopt invariants to a much higher degree than today.
Terminology and background
Invariants were added in version 9.0.0 as opt-in. Most invariants are checks enforced at the operation level. When invariants fail, an error is logged.
Hard invariants are invariants that when they fail cause the instance to crash.
Invariants are a form of insurance against bugs: assume that bugs will be encountered, but when it happens ensure that the impact of those bugs is as small as possible.
Goals
Primary goal
Progressively turn all operation level invariants into "protocol level invariants" to protect the integrity of the ledger when encountering bugs network wide.
Secondary goal
Allow custom invariants to be enforced on some validators (not network wide) while preserving network liveness, this is also how "staging" of new invariants can be implemented before invariants become "protocol invariants".
Non-goals
Enable invariants that are not transaction or operation level invariants.
The only invariant that we have right now in that category is
BucketListIsConsistentWithDatabase
, which is already a "hard" invariant (that crashes the node) when enabled. This invariant is only enforced when applying buckets during catchup and is designed to detect corruption of the node. As this invariant is enforced outside of consensus it cannot be considered "protocol level".Approach: make transactions "fail"
Here "protocol level invariants" would "fail" transactions with the error code
txINTERNAL_ERROR
.This is fairly easy to implement as the functionality was already put in place to deal with unknown runtime errors without crashing the instances.
Potential issues with this solution are:
Optional invariants
Background
Optional invariants need to be enabled differently:
Approach
Optional invariants would be enabled in such a way that optional invariants would be checked after protocol level invariants (so that protocol level invariants would trigger first) and would not cause the results to be different from a node not running without any optional invariants enabled.
The difference would be in what happens when an optional invariant fails, the validator would:
Recovering from a tripped invariant would require running a command that would reset the fuse to a clean state and would also unblock historical data.
As the node stops voting there is a chance that the network halts if too many nodes had the same optional invariant enabled - as the node is actually watching the network but just not voting, resetting the fuse and forcing the node to send its SCP message (that was withheld) should allow to unblock the network.
Other approaches that were considered
Crash the node
Crashing the node seems to be the simplest approach at first glance: it's simply makes protocol invariants "hard invariants" which stops stellar-core from closing the ledger when failures are detected.
When this happens, core aborts, crashing the node and leaving it at the previous ledger (rolling back any partial changes to the database if necessary).
In addition to leaving the ledger in a "last known good" state, it also guarantees that downstream systems such as Horizon never see "bad" transactions that failed invariants.
The difference between a protocol level invariants and optional invariants would just be in the scope of the number of validators that would crash as protocol level invariants are enforced by all nodes. Optional invariants could be enabled instead as described in
Issue with this approach is that if the invariant failure affects a v-blocking set (which is the case for protocol level invariants as it effects all nodes), the entire network is crashed.
The recovery from this type of failure can be tricky as the transaction set committed by the various validators (SCP) contains a sequence of transactions that caused the invariant failure.
To recover from such a crash, a fix for the root cause of the invariant failure needs to be deployed to all validators.
Optimistically speaking with this approach we're looking at a downtime in the order of days which is not acceptable.
Schemes to somehow skip transactions from the transaction set therefore need to be established.
Corruption of internal state of the ledger may cause crashes (failed invariant) even though bad transactions were processed in earlier ledgers.
Skip the entire transaction set
When encountering a protocol level invariant failure, the validator would:
Issues with this approach are:
Corruption of internal state of the ledger may cause skipping of arbitrary ledgers.
Skip transactions
Instead of skipping the entire transaction set, we can imagine marking transactions that fail invariants with a new special result (
txSKIPPED
) that indicates that the transaction was skipped entirely.When this happens,
txfeehistory
andtxhistory
would not contain information about skipped transactions (but history would) - this would isolate downstream systems from having to deal with "duplicate" transactions (as those transactions can be resubmitted in the future).Issues with this approach are:
Corruption of internal state or of the ledger may cause skipping of arbitrary transactions.
A variation for skipping transactions is to skip those during consensus which would avoid having to add the new
txSKIPPED
result.The implementation would require the same type of logic than out of consensus (requiring additional validation during consensus that could be expensive) and would still be subject to the same DoS attack if done only during Ballot Protocol.
If done for all transaction sets during nomination it may work but the performance impact might be too high as extra validation is equivalent to "applying" the transaction set many times over.
Reject transactions upstream (can potentially done with all solutions)
In order to minimize (but not eliminate) the chance of running into invariant failures during consensus, it might be possible to validate transactions by actually "applying" them (without committing) when:
For this to scale properly:
This may also help with smart contracts that submit single transactions to the network .