The non-pruning HexaryTrie is dangerous to use

Transferring the issue description from https://github.com/status-im/nimbus/pull/229#event-2141830790:

Digging deeper, I finally understand the problem. And we can leave the trie as the last suspect in this case.

The problem involves two or more accounts with the same storageRoot. To simplify the illustration, I will use only two accounts.

During transaction U execution, account A was modified and it has storageRoot=0xABCDEF During transaction V execution, account B was modified and also arrive at storageRoot=0xABCDEF

Later on, during transaction X, either account A or B is modified again and it has storageRoot=0xDEADBEEF. Here the problem arise, during transaction Y execution, account with storageRoot=0xABCDEF will have corrupted/incomplete trie. The other account that previously have the same storageRoot already doing damage on the trie if the pruning is turned on.

Perhaps this is more of design issue.

@zah, you know the hexary trie better than me, do you have any idea how to approach this situation? although my little fix works, but conceptually it doesn't remove the problem.

I've considered two possible solutions for this problem:

1) Detect the insertion of duplicate keys and store them in a second table keeping reference counts. The rationale here is that duplicate keys are rare and we don't need to waste memory by including a reference count for every DB value. The key deletion logic will also have to be modified to take into account this additional reference counting table.

When I say "table" here, I mean using the key-value store with a new reserved key prefix. The Nimbus-specific key prefixes will have to start from some higher offset (we'll define something like userKeyPrefixesStart constant in the Trie lib).

2) Implement a mark and sweep GC and use a non-pruned DB.

I expect this to be quite slow, so perhaps it's not a practical solution. It does have certain advantages though such as the ability to perform roll-backs and switching to a different fork without the need for more complicated book-keeping.

ping @jlokier , this is the problem I mentioned in yesterday meeting. We need to solve this before solving world-trie problem.

because majority of the accounts are "regular" account, they don't have storage trie. and majority of contract account who have storage trie, the storage also have only few entries. storing something to storage trie is expensive, and can be translated to real world money.

because of the rarity of storage trie occurance compared to "regular" account, I have an idea to store the whole storage trie as a single KV pair instead of a sub-merkle trie.

this approach is simple and reliable and avoid the complexity of refcount. we can recompute the storageRoot easily using in-memory merkle trie and then discard all unnecessary merkle-KV pair.

storing the whole storage trie encoded as whatever serializer we choose(e.g. RLP/SSZ) also provide private container for each account so it does not interfere with other account even though they have identical storageRoot.

my idea + zahary's:

K will be prefix + account address, we can choose whatever suitable prefix, including blockNumber if we want to preserve history. prefix can be some sort of versioning. of course it also can be a suffix rather than a prefix.
V will be the current version of ordered/unordered list of entries in storage trie.

|---------------------------K---------------------|--------------------------V--------------------|
|-----------prefix----------|----address----------|[(SLOT,VAL), (SLOT,VAL), (SLOT,VAL), .........]|    
| `blockNumber`+ version----|
| -------------------------can be hashed ---------|

during the lifetime of the account, it can have multiple version of this storage like full archived merkle trie does, but it will faster to retrieve/write, simple to implement, and using less storage space. but of course this is only an idea, we can further explore this idea. refine it and optimize it.

one optimization could involve storing all version of this trie in one big KV:

|---------------K-----------------|------------------------V-----------------------------------|
| address + short prefix/suffix   | version + [SLOT-VAL LIST], version + [SLOT-VAL LIST], .... |
| --- can be hashed --------------|

using this approach we can further reduce the long K storage usage + drawback from the increase of decoding complexity.

@jangko, I think there are some excellent ideas in there.

I like the storage blob. (I think I've read someone else proposing the flat blob idea too.)

I'm inclined to agree that deduplicating (finding common storage trie hashes) and/or refcounting in the main storage database isn't worth it. It will be rare for two accounts to have the same data unless the data is small, and if it's small, deduplicating probably isn't worth it. For large data the gas costs mean strong pressure against any two accounts duplicating the same storage. (However I'm not entirely sure about this, who knows how much state is duplicated throughout Ethereum, so it would be good to analyse the state sometime.)

I think the strongest reason to encode the storage trie as a flat blob is for database I/O performance: Reading a small number of 256-bit words (and encoding overhead) is just as fast as reading one word, and probably very compressible. But even better, it's much faster than reading multiple storage words at different hash-keys. This is because hash-keys in the main database mean I/O to separate database blocks. Walking the Merkle trie to get a storage word is even worse, multiple I/Os to different blocks.

However, I would like to know if storage is consistently small? The gas cost of writing is high, but surely there are some contracts that maintain large storage written by many transactions over time. Do you have any statistics on storage trie size distribution, and the maximum storage trie size?

I would rather use a different representation if we can't count on small storage, and I assume by default we can't. The different representation I have in mind depends on the type of kvstore (which is another area I may change). If it's a standard kvstore, for storage that gets above some size threshold, we replace (SLOT,VAL) pairs with a layer of indirection (SLOTGROUP_RANGE,SLOTGROUP) just for any large subranges. Just for those slotgroups, the data is stored in the same kvstore. index columns (in K) become: prefix+address+slotgroup.

If it's a non-standard kvstore that guarantees efficient prefix-compression of ordered keys, then I would use the facilities of the kvstore itself to optimise the disk block layout. This is because a sequence of ((SLOT,VAL)...) can be represented by the database itself while clustering the account's storage. That will result in faster lookup (less block I/O) in some cases, because the database engine knows its block boundaries. RocksDB probably stores data prefix-compressed this way, but it has other problems with our use unfortunately. SQLite's B-tree doesn't do this, although B-trees can do this. So this idea doesn't work with our current backend stores.

I agree with many parts of the reasoning behind the two key-value tables shown.

But there's a good reason to avoid "can be hashed": A major contributor to database I/O performance is the hashing causes scattered keys, and therefore randomly accesses DB block I/O for every row. Part of the solution to reduce I/O overhead is clustering related things, according to whatever the access pattern will be. Unfortunately there's no single access pattern, and pruning past data in particular complicates this. However, there is a general principle: We try to avoid hashing keys in our private database when there is no advantage to it.

blockNumber+version+address is a great idea in terms of clustering accesses over time. This order (unhashed) is almost ideal for pruning past data, when we can select or range-delete rows from the kvstore.

I think this representation may also be effective for historical and current account data as well (balance, nonce etc.). (In other words, not using an explicit world-state trie for accounts either. This big change to the hexary trie is what I'm working on at the moment.)

However, the key order is a big problem for looking up the data or storage of an account: Because we can't store data of all accounts in a single blockNumber+version, just finding the latest row for a particular account will need a slow scan.

To find it quickly, we can swap the key columns: address+blockNumber+version. Then given an account, and an in-memory representation of the blockchain graph for blocks still within the reorg window, we can quickly find the correct row and the current state of the account. This resembles a column store. Pruning past data it cannot do, but an index on blockNumber+version+address fixes that. Then for I/O performance, the interesting question is which form of key is better as an index, and which is better as storage primary key? Should they both be indexes (like SQLite does for all tables by default). It depends on pruning costs: Even if we can find the rows to delete quickly, deleting data from scattered rows can take a lot of I/O, as was found with Eth2.

(For pruning deletion performance, the epoch idea may help at the cost of read performance (another column in the keys), or files, or state-diff representations, SCCS-style weave, RCS-style reverse deltas, Git-pack style combination of them. There are many options on the table for optimising storage changing slowly over time.)

When the storage is in memory, I don't think it matters what encoding is used, as long as it's fast enough and doesn't use too much memory. It matters only for encoding/decoding. During EVM execution the storage tries are cached in simple Nim tables (overlay storage). We don't want any of these changes to be persisted to the real database during execution of any EVMs, until we decide, often at the end of the block.

Sometimes even after the end of a block, we don't want to persist storage changes yet or even account changes. We'd rather wait for multiple blocks, because we can quickly recalculate the state of a small number of blocks if we lose it from memory due to crash/restart. This will save a little storage space and write I/Os.

At some points we want to calculate the storage trie hash for the account, but as you point out, maybe we never want to store an actual trie for the account storage, just calculate the hash if needed for P2P.

There is one significant problem with not indexing storage tries fully by hash though: The Eth/65 and earlier sync protocols depend on looking up storage trie nodes by hash, via getNodeData.

The snap sync protocol is better for this. Supporting database optimisations is an explicit goal of snap sync: The spec repeatedly says "break the generality that causes issues with database optimizations". We could commit to only support snap sync if that's helpful. Older clients won't talk to us, but current Geth will.

To support snap sync, there are some constraints on data layout which we must satisfy for the database to perform well. This may affect the key-value schema we use.

Do you have any statistics on storage trie size distribution, and the maximum storage trie size?

probably we need to ask turbo geth team about statistics, they diligently measure the database in and out.

status-im / nim-eth

The non-pruning HexaryTrie is dangerous to use #9