MonsieurNicolas commented 6 years ago

Introduction

This thread is for discussing how we can adopt invariants to a much higher degree than today.

Terminology and background

Invariants were added in version 9.0.0 as opt-in. Most invariants are checks enforced at the operation level. When invariants fail, an error is logged.

Hard invariants are invariants that when they fail cause the instance to crash.

Invariants are a form of insurance against bugs: assume that bugs will be encountered, but when it happens ensure that the impact of those bugs is as small as possible.

Goals

Primary goal

Progressively turn all operation level invariants into "protocol level invariants" to protect the integrity of the ledger when encountering bugs network wide.

Secondary goal

Allow custom invariants to be enforced on some validators (not network wide) while preserving network liveness, this is also how "staging" of new invariants can be implemented before invariants become "protocol invariants".

Non-goals

Enable invariants that are not transaction or operation level invariants.

The only invariant that we have right now in that category is BucketListIsConsistentWithDatabase, which is already a "hard" invariant (that crashes the node) when enabled. This invariant is only enforced when applying buckets during catchup and is designed to detect corruption of the node. As this invariant is enforced outside of consensus it cannot be considered "protocol level".

Approach: make transactions "fail"

Here "protocol level invariants" would "fail" transactions with the error code txINTERNAL_ERROR.

This is fairly easy to implement as the functionality was already put in place to deal with unknown runtime errors without crashing the instances.

Potential issues with this solution are:

if an operation corrupts core's internal state of ledger yet causes a different transaction to fail invariants, the "blame" goes to the wrong transaction.
- This may sound like a far fetched example, but bugs in caching subsystems will tend to manifest themselves like this.
- A potential (weak) mitigation for this is to
  - clear all caches at periodic intervals - for example, clear caches when closing a ledger.
  - monitor (alarm) when an invariant trips (monitoring should be enabled regardless)
  - assume that the chance of running into such a situation is low enough that we don't have to address this directly
the transaction failure is recorded in history, this is a problem if the invariant failure didn't blame the right transaction.

Optional invariants

Background

Optional invariants need to be enabled differently:

If enabled to fail transactions, the failure mode would not be desirable as the local node would corrupt their local state (compared to the rest of the network), and
- publish this "1 ledger fork" to downstream systems
- would fall out of sync on the next SCP round as it would not agree with the "previous ledger hash" (causing the node to stop contributing to the network)
the bad ledger would also be published to downstream systems such as Horizon (this is a general problem with how to deal with forks).
The only recovery would be to reset the instance a mechanism to "rollback" a validator to the previous ledger to resolve this issue.
If enabled as "hard invariants", there is a risk of crashing the entire network.

Approach

Optional invariants would be enabled in such a way that optional invariants would be checked after protocol level invariants (so that protocol level invariants would trigger first) and would not cause the results to be different from a node not running without any optional invariants enabled.

The difference would be in what happens when an optional invariant fails, the validator would:

trip a "fuse" persisted in its database (as it closes the ledger)
with the fuse blown, it would
- report in metrics and the info endpoint that an invariant failed
- stop sending SCP messages
- stop publishing to txhistory and related tables and instead only publish into "holding tables"
  - during normal operation, the flow would be to publish to holding tables and publish to txhistory (from holding) delayed by a ledger. Note: during normal operating mode, transactions get therefore confirmed not in 0-5 seconds as in now, but in 5-10 seconds (delayed by 1 ledger).
  - this leverages the same tech than the one used to mitigate the effect of a "1 ledger fork" type of situation
  - alternate solution here could be to mark historical entries (ledger headers, txhistory, etc) with a new property (flag "invariant_failed" that indicates downstream systems to potentially halt ingestion) but
    - this would require the pub/sub semantics to change as to allow publishing data for duplicate ledgers (when clearing the fuse)
    - would require downstream systems to deal with bad data
- continue to perform its duties wrt history archive as normal

Recovering from a tripped invariant would require running a command that would reset the fuse to a clean state and would also unblock historical data.

As the node stops voting there is a chance that the network halts if too many nodes had the same optional invariant enabled - as the node is actually watching the network but just not voting, resetting the fuse and forcing the node to send its SCP message (that was withheld) should allow to unblock the network.

Other approaches that were considered

Crash the node

Crashing the node seems to be the simplest approach at first glance: it's simply makes protocol invariants "hard invariants" which stops stellar-core from closing the ledger when failures are detected.

When this happens, core aborts, crashing the node and leaving it at the previous ledger (rolling back any partial changes to the database if necessary).

In addition to leaving the ledger in a "last known good" state, it also guarantees that downstream systems such as Horizon never see "bad" transactions that failed invariants.

The difference between a protocol level invariants and optional invariants would just be in the scope of the number of validators that would crash as protocol level invariants are enforced by all nodes. Optional invariants could be enabled instead as described in

Issue with this approach is that if the invariant failure affects a v-blocking set (which is the case for protocol level invariants as it effects all nodes), the entire network is crashed.

The recovery from this type of failure can be tricky as the transaction set committed by the various validators (SCP) contains a sequence of transactions that caused the invariant failure.

To recover from such a crash, a fix for the root cause of the invariant failure needs to be deployed to all validators.

Optimistically speaking with this approach we're looking at a downtime in the order of days which is not acceptable.

Schemes to somehow skip transactions from the transaction set therefore need to be established.

Corruption of internal state of the ledger may cause crashes (failed invariant) even though bad transactions were processed in earlier ledgers.

Skip the entire transaction set

When encountering a protocol level invariant failure, the validator would:

Rollback any changes (if necessary) made by applying transactions (including ledger header)
Construct an empty TransactionResultSet to indicate that the ledger didn't apply transactions
Apply upgrades (if any)
Trigger a new ledger right away
"ban" transactions that were included in the transaction set for the next X ledgers
- Banning transactions means not voting for transaction sets that include those transactions

Issues with this approach are:

An attacker can DoS the network by constantly submitting transactions that violate invariants
Complexity in properly implementing "ban"

Corruption of internal state of the ledger may cause skipping of arbitrary ledgers.

Skip transactions

Instead of skipping the entire transaction set, we can imagine marking transactions that fail invariants with a new special result (txSKIPPED) that indicates that the transaction was skipped entirely.

When this happens, txfeehistory and txhistory would not contain information about skipped transactions (but history would) - this would isolate downstream systems from having to deal with "duplicate" transactions (as those transactions can be resubmitted in the future).

Issues with this approach are:

An attacker can DoS the network by submitting bad transactions with a high fee that use up all "slots" in the transaction set, knowing that the fee would not be processed
Complexity in properly implementing "ban"
Complexity from having to redo fee processing and apply all other transactions (basically rollback, redo everything after removing transactions)
Need to reconcile skipped transactions with consensus

Corruption of internal state or of the ledger may cause skipping of arbitrary transactions.

A variation for skipping transactions is to skip those during consensus which would avoid having to add the new txSKIPPED result.

The implementation would require the same type of logic than out of consensus (requiring additional validation during consensus that could be expensive) and would still be subject to the same DoS attack if done only during Ballot Protocol.

If done for all transaction sets during nomination it may work but the performance impact might be too high as extra validation is equivalent to "applying" the transaction set many times over.

Reject transactions upstream (can potentially done with all solutions)

In order to minimize (but not eliminate) the chance of running into invariant failures during consensus, it might be possible to validate transactions by actually "applying" them (without committing) when:

They get submitted by the "tx" end point
Encountering a transaction for the first time (when receiving a transaction set during consensus for example)

For this to scale properly:

Transaction validation would have to be run in parallel
Transaction validity would have to be cached until the next ledger close (both positive and negative)

This may also help with smart contracts that submit single transactions to the network .

stanford-scs commented 6 years ago

The only recovery would be to reset the instance a mechanism to "rollback" a validator to the previous ledger to resolve this issue.

This might not be a good idea. It might be better to change the validator's public key, and just pretend you deleted the validator and created a new one.

I think it would help to contextualize this discussion with some examples. I'm wondering what an example would be of an optional invariant failing and the invariant-checking mechanism then allowing a non-disastrous outcome.

MonsieurNicolas commented 6 years ago

This might not be a good idea. It might be better to change the validator's public key, and just pretend you deleted the validator and created a new one.

Yes this is probably better; I am not proposing any of those things anyways in that section (it was for illustration purpose)

I think it would help to contextualize this discussion with some examples. I'm wondering what an example would be of an optional invariant failing and the invariant-checking mechanism then allowing a non-disastrous outcome.

All of our operation level invariants are "soft" right now (notify only), but I could see enabling them as optional invariants (default on) before enabling them as "protocol level invariant" (normal evolution of how to "promote" invariants). The other type of optional invariants that I think we'll start to see are custom invariants that people want to enforce for their token: a token issuer can enforce that only certain things can happen to their token (such as "no new tokens issued for my tokens") with their validator, and people that really care about those tokens can add those validators to their quorum set.

theaeolianmachine commented 5 years ago

@MonsieurNicolas, do you want to work this into a draft? Re-open the discussion on the mailing list? Hand it off to someone else?

stellar / stellar-protocol

Enabling invariants at the protocol level #125

Introduction

Terminology and background

Goals

Primary goal

Secondary goal

Non-goals

Approach: make transactions "fail"

Optional invariants

Background

Approach

Other approaches that were considered

Crash the node

Skip the entire transaction set

Skip transactions

Reject transactions upstream (can potentially done with all solutions)