Improve node shutdown/restart robustness

I have spent a few days digging into the low level process of how we handle the "transactional" txhashset extension.

We now have a better understanding of exactly what the issue is when we see "corrupted" data on startup, and a proposed solution.

Quick overview of the steps involved (processing a new full block, assumes header already processed for simplicity) -

begin lmdb transaction (aka batch)
begin an lmdb "child" transaction
begin txhashset extension (in memory)
rewind and apply fork
1. no-op if block is "next" block w.r.t current chain head
2. calculate "fork point" (header where chain diverges on fork)
3. rewind mmrs to fork point (in memory)
4. apply all necessary block to put forked chain in correct state
5. apply new block
if fork is less total work then main chain
1. discard mmr changes (discard in memory changes)
2. discard (rollback) the child transaction
  1. output_pos index is reverted etc.
3. commit the outer transaction
  1. save block to db (needed if we subsequently process this fork again)
if fork increases the total work (main chain extended or in reorg scenario)
1. sync updated mmr files to disk
  1. sync output mmr
    1. truncate hash file
    2. append new bytes to hash file, flush to disk
    3. truncate data file
    4. append new bytes to data file, flush to disk
    5. rewrite the "leaf_set" file, flush to disk
  2. sync rangeproof mmr (see output mmr above)
  3. sync kernel mmr (see output mmr above, excluding leaf_set)
2. commit the child transaction (changes rolled into outer transaction)
  1. output_pos index updated
  2. various block specific data saved to disk
  3. chain head updated
3. commit outer transaction
  1. save block to db
  2. save all child transaction updates to db

The key points from above are that we do the following -

start a db transaction
truncate, extends and flush various files to disk
commit the db transaction

We attempt to be as "atomic" as possible but the various file operations (truncate and flush) are not transactional.

Our issues arise when we experience an unclean shutdown during step (6) above. We attempt to minimize the chance of corruption in various ways but once we start truncating files on disk and/or appending new bytes to them we must complete this entire process successfully (up to and including committing the outer lmdb transaction) or we risk corrupting the data on disk.

[tbd - describe proposed "checkpoint" solution]

The (P)MMR data structures are append-only. We can prune them (for output and rangeproof MMRs) which removed contiguous chunks of bytes. But we never overwrite or update old data. It is immutable, with one exception, we support "rewind" which can rewrite recent data (but only recent data).

Note: During rewind we determine the "fork point", the point where the fork diverges from the current main chain. All data after the fork point is subject to change, it will be rewritten. All data before the fork point is immutable within the context of this single block being processed.

This gives us a known block (header) where earlier data is guaranteed to be immutable. From this header we can determine byte locations (via mmr sizes in the header) for all hash and data files in the various MMRs. We can guarantee data cannot be corrupted prior to this point.

In the event of needing to recover from corrupted data we can revert to this "checkpoint", effectively resetting the chain to the "fork point" and reprocess blocks from this point forward.

In the common case the fork point matches the current chain head. Recovery in this situation is to discard any data partially written to the MMR files. In the less common fork/reorg scenario we will potentially discard one or more full blocks of data that may have been partially rewritten, allowing us to reprocess full blocks safely from this reverted state.

[todo - describe why the leaf_set remains an ugly issue]

mimblewimble / grin

Improve node shutdown/restart robustness #3267