kadena-io / chainweb-node

Chainweb: A Proof-of-Work Parallel-Chain Architecture for Massive Throughput
https://docs.kadena.io/basics/whitepapers/overview
BSD 3-Clause "New" or "Revised" License
249 stars 92 forks source link

Pact DB: checksumming, tx types, compaction #1451

Open sirlensalot opened 2 years ago

sirlensalot commented 2 years ago

EDIT this is being worked on in #1461 . Following description is out of date, see later comments


Omnibus issue for strategic upgrade to Chainweb Pact database to support:

  1. Verifiable compaction. This is mainly removing journal rows from versioned tables, plus smaller prunes like SYS_pacts and checkpointer system tables.

  2. Block/table/tx/row SHA3 checksumming.

Checksumming

Checksums are built from row-level checksums that are combined to produce table-level checksums that are further combined to produce a global block-level checksum. All three values are cached to support various optimizations as described below.

The checkpointer versioned table data model stores row data by key and an increasing txid value. This means that given the set of all keys K, the "current state" is indexed by the set {(k0,max-txidk0), ... } where max-txidk is the highest txid stored for k. This is called the active key set.

To support rewinds, the checkpointer caches the highest txid at each block height, known as the "last txid". Rewinding is accomplished by simply deleting all rows with txids higher than the "last txid" of the target block height, which restores the active key set as of that block height. "Last txid" for a given historical block height can thus be used to compute the active key set as of that block height by constraining max-txid to be less than or equal to the block's last txid.

Row checksums

Checksums are computed for the active key set from row-level checksums that capture the row data, key and txid value. At any change in the active key set this checksum thus changes. Therefore it is necessary to cache/store the row checksums to avoid recomputation of every active key in the system.

As there can be more than one (key,txid) pair for a given key in a block, row checksums incorporate the checksum for the key at the previous txid to capture inter-block changes for that key in the checksum for the max-txidkey for that block.

Block and table checksums

At the block level, a checksum captures the active key set of the entire database. As this changes, it would be attractive to be able to perform an incremental computation that simply "adds" any new row checksum for (k, new-txid), and "subtracts" the checksum for (k, new-txid - 1) when new-txid > 0. However we are not aware of a suitable algorithm/hashing technique for this.

Instead, we construct a checksum anew at each block height from all the row-level checksums for the global active key set. This can be optimized by constructing and caching intermediate checksums at the table level (that is, for the active key set for the keys in a given table only) and building the global value from the table checksums, since cached values can be used for tables untouched during a given block. Even so, this is not a sufficient optimization for online use, as individual table key sets can grow large enough to make computing a checksum for that table at each block prohibitive.

Checksumming/hashing is not associative, so we have to commit to global or per-table construction. In the absence of an online approach it would seem that the per-table optimization is not needed. However it will be seen that table checksumming has use during compaction as an "on-the-go" validation, so that corruption can be detected at the table level and not just opaquely fail at the global/block level.

Checksum Implementation Notes

SQLite can be instrumented with sha3 support at the scalar and aggregate level, see https://www.sqlite.org/src/file/ext/misc/shathree.c. This allows checksumming to be achievable entirely within sqlite.

While there are attractive reasons to compute checksums in the Haskell runtime, given that online checksum computation is not in sight for this approach, it seems more attractive to confine the operation to sqlite operations only.

Compaction Compaction is achieved by deleting historical rows up to some block height. This is accomplished by computing the active key set for that block height, which as described above, finds every (k,max-txidk) where max-txid is less than or equal to the "last txid" for that block height. Compaction is achieved by deleting all rows for a key where txidkey <= max-txidkey <= last txidtarget block.

Compaction can of course target the max block height for the given chain. However this is not usable for Chainweb in the face of a reorganization. Thus, when servicing an active chainweb node, compaction should target a suitably "old" block height to allow for sufficient rewinding in case of a reorganization upon node restart.

Compaction can also delete all future rows past the key's max-txid. This would require a replay for missing blocks at startup. However this could be seen as a valuable, if incomplete, integrity check. This would also require rewinding all checkpointer system tables to the target block height as well.

Compaction verification

As checksums can be computed for a given block height as described above, compaction must result in the same active key set at the target block height. Thus the full compaction operation is:

An implementation note is that with the current schema, deletions on large tables like coin_coin-table are unacceptably slow if performed via a single DELETE FROM with CTEs to find the appropriate txid ranges for a given key.

sirlensalot commented 2 years ago

The following sql script achieves a 78% size reduction on mainnet.

#!/bin/bash

set -e

db="$1"
ls -lh $db
if [ ! -f "$1" ]; then echo "need db file"; exit 1; fi

for tbl in `echo ".tables" | sqlite3 $db`; do
    case $tbl in
        BlockHistory) ;;
        TransactionIndex) ;;
        VersionedTableCreation) ;;
        VersionedTableMutation) ;;
        *)
            echo "Vaccuming $tbl"
            echo -n "rows: "
            echo "select count(*) from [$tbl]" | sqlite3 $db
            echo -n "keys: "
            echo "select count(*) from (select distinct rowkey from [$tbl])" | sqlite3 $db
            echo -n "deleting ... "
            # the following was incredibly slow, DELETE FROM with CTE seems unusable on large (>100k rows) tables
            # echo "begin; with t as (select t1.rowkey,t1.txid from [$tbl] t1 where t1.txid != (select max(txid) from [$tbl] t2 where t2.rowkey = t1.rowkey group by rowkey)) delete from [$tbl] where (rowkey,txid) in (select * from t); select count(*) from [$tbl]; commit;" | sqlite3 $db

            echo "create temporary table tmp (rowkey text, txid unsigned bigint, rowdata blob);
insert into tmp select distinct rowkey, null, null from [$tbl];
update tmp set txid = (select max(c.txid) from [$tbl] c where c.rowkey=tmp.rowkey);
update tmp set rowdata = (select c.rowdata from [$tbl] c where c.rowkey=tmp.rowkey and c.txid = tmp.txid);
delete from [$tbl];
insert into [$tbl] select * from tmp;
select count(*) from [$tbl];" | sqlite3 $db
    esac
done

echo -n "Vacuum ..."
echo "VACUUM;" | sqlite3 $db
ls -lh $db