veres-one / veres-one-validator

A ledger validator that accepts either signatures or proof of work
Other
3 stars 0 forks source link

State machine errors that may be due to temporary conditions must halt state calculation #5

Open dlongley opened 6 years ago

dlongley commented 6 years ago

See: https://github.com/digitalbazaar/bedrock-ledger-node/issues/25

mattcollier commented 4 years ago

This is what was stated in the ledger-node issue

This may need to be filed under specific consensus algorithms instead (or as well)... but essentially, if operations are ignored due to validation errors, those errors must have a characteristic such that they will always happen regardless of external factors such as database or network access failures. Otherwise an inconsistent state could be introduced.

I'm not sure what action should be taken here. @dlongley can you make specific recommendations?

mattcollier commented 4 years ago

This is information related to operation validation on the input side of things. The records returned from the legerNode.records.get API is built from only those operations that have achieved consensus and are correct. Invalid operations are not 'ignored' due to validation errors, those operations are never allowed to enter the network in the first place. A malicious or improperly implemented node will not be able to propagate invalid operations via gossip due to validation occurring during the gossip process.

Summary

Research indicates that temporal validation errors due to low level storage failures may result in a new operation or gossip being rejected, but these failures do not result in any permanent inconsistent state on the node.

New Operations

Operations arriving via HTTP on a ledgerAgent are immediately sent to the ledgerNode.operations.add API: https://github.com/digitalbazaar/bedrock-ledger-agent/blob/master/lib/http.js#L290

The ledgerNode.operations.add API passes the operation immediately through the validator where an error is thrown if valid === false. If the error is temporal in nature, the user may retry submitting their event at a later time. https://github.com/digitalbazaar/bedrock-ledger-node/blob/b5858a97d16de9ce082fd70bdb7c111cb7842c1e/lib/LedgerNodeOperations.js#L39-L44

Continuity Gossip

During Continuity gossip, operations travel through the addBatch API https://github.com/digitalbazaar/bedrock-ledger-consensus-continuity/blob/d37411b5d6254c7492af575812e8a62e8ab6ac16/lib/agents/gossip-agent.js#L76-L77

In the addBatch API, the ledgerNode validate API may return {valid: false, error} for a variety of reasons. The error could be a proper validation error or errors captured from other low level APIs. Validator code is not designed to throw, but that may occur in some rare instances. Whatever the case, if valid === false the addBatch API throws here: https://github.com/digitalbazaar/bedrock-ledger-consensus-continuity/blob/d37411b5d6254c7492af575812e8a62e8ab6ac16/lib/events.js#L143-L144

An error thrown in the addBatch API is caught here, which results in the termination of the gossip session. https://github.com/digitalbazaar/bedrock-ledger-consensus-continuity/blob/d37411b5d6254c7492af575812e8a62e8ab6ac16/lib/agents/gossip-agent.js#L109-L114

A gossip session that terminates due to an error like this results in the gossip peer being backed off, the gossip operation will be retried later at increasing intervals. If the gossip was initially rejected due to some temporal failure on the local node, it should succeed during some later attempt, possibly after intervention by a SysAdmin.