Binary tries for substrate/Towards NOMT

eskimor commented 2 months ago

It is known that our radix-16 trie does not scale too well with regards to proof size. TPS numbers are dropping (on a high level) already when reaching millions of accounts. By migrating to a binary trie, we can push this immediately by three orders of magnitude, reaching global scale. This becomes hugely relevant now as we are gaining traction and projects start pushing numbers. We should aim to have a binary trie production ready, including a migration path by 2025.

NOMT implements a binary trie and also makes sure to resolve other bottlenecks on the collator in block building, but it is unclear whether we can port it to substrate in that time frame. What we should be doing though, is move towards NOMT: It should be evaluated whether we can implement a binary trie, that is easy enough to include into substrate, that is already binary compatible with NOMT.

Considerations for this from @rphmeier :

The main thing is just to have the node format of the trie be consistent, insofar as what actually gets hashed. The database backend doesn't matter too much. But stuff like SCALE-encoding nodes is going to break compatibility with NOMT.

We have two kinds of nodes currently. Leaves and internal. Leaves always start with MSB 1 and internal nodes always start with MSB 0.

The internal hash is computed by H(child1, child2) and the leaf hash is computed by H(key, value_hash).

Since you're dealing with long keys you'd also want to add extension nodes to that format.

An extension node might be represented as H(extension_size, extension_bits). I think that we should probably limit each extension to something like 256-8 bits so you can pack the length and extension bits into a single hash-sized slot on disk and also steal another MSB from the node format to support extensions.

256-8 bits per extension should be fine; you'd only need maximum 9 extensions to cover 2KB of shared key material. I expect the actual amount of shared key material to be more like 512 bits with the current substrate format (256 bits for pallet, 256 bits for map)

Having an extremely compact format is also going to make it simpler to do stuff like ZK prove the merkle proofs in the future.

oh, last thing: the choice of hash function will be impactful for ZK proving in the long term. In October we'll start to try proving large merkle trie reads/writes with Poseidon2

Timeline

If we plan for success, we have to assume that next year there will exist teams that would benefit a lot from binary tries. Sustaining tps numbers into the billions, instead of millions can make a lot of a difference. Achieving NOMT compatibility allows an easier future upgrade, lowering hardware requirements on collators a lot and pushing possible throughput on a single parachain to new levels.

Therefore I would aim for the following:

2025 H1: Have a binary trie implementation in Substrate that is compatible with Frame. Any new parachain can already use this. 2025 H1/H2: Provide some migration plan for existing parachains to the new trie. 2026: Optimize the database/storage -> Go full NOMT.

Further Material

https://www.rob.tech/blog/nearly-optimal-polkadot-merklization/

rphmeier commented 2 months ago

An extension node might be represented as H(extension_size, extension_bits).

I wrote this offhand on Signal but there's an error since it doesn't commit to anything below it. It should probably be

H(extension, left, right).

left/right are because extensions need to have branch semantics.

extension is a 256-bit encoding with the first 8 bits being the length and the last 248 bits being the extension.

One of the downsides of extension nodes is that they do require hashing 96 bytes instead of 64, which means that you pay an additional cost vs a single hash-compression. But it's probably fine.

Alternatively you can do

H(extension, H(left, right)) to always do 64 byte hash compressions. That kind of format will probably play more nicely with ZK.

rphmeier commented 2 months ago

Sustaining tps numbers into the billions, instead of millions can make a lot of a difference. Achieving NOMT compatibility allows an easier future upgrade, lowering hardware requirements on collators a lot and pushing possible throughput on a single parachain to new levels.

I'll relay a little bit about performance that we discussed on the call as well.

The binary trie format is a trade-off. It's worse for disk accesses but better for proof size. Every time you have to fetch a trie node, you need to do a disk read. On a modern SSD at QD1 that's optimistically 40us and realistically 60us.

So if you have logic like Substrate's that traverses the trie to load values, you might expect to do log2(state_size) fetches per lookup. With 2^25 state items (~35M) that'd be 1.5ms per state read. That means you'd be able to do maximum ~650 sequential state fetches per second. Definitely not enough! For 1000TPS you need to be able to do at least 2000 state fetches per second.

The escape hatch is to rely on the OS page cache to avoid these reads from going to disk. Both database backends we use in Substrate are fine for this because ParityDB uses mmap and RocksDB uses buffered I/O, so the OS will just load larger amounts of the DB files into memory and transform an on-disk workload into an in-memory workload.

That does mean that you will need nodes to run with enough RAM to essentially store the first N levels of the trie reliably. To get into the billions you would need at least 64GB of RAM.

The downside of buffered I/O and mmap is that the OS page cache carries some overhead in kernel space and it's difficult to achieve maximum IOPS with it activated. See https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/#io-configuration---direct-io . If you want to make use of maximum disk throughput, you can't really use it.

The proper architectural way to solve this (and what NOMT does) is to have a "fast" database based around a btree that can get you any state item in a single SSD round-trip and a "slow" database that stores merkle pages. Then parallelize the fetches of merkle pages to take advantage of the SSD throughput, but without blocking block execution.

burdges commented 2 months ago

The binary trie format is a trade-off. It's worse for disk accesses but better for proof size.

There is no reason every layer should be stored on disk, just store every 3rd or 4th layer, and recompute the missing copath elements. It's only a few extra calls of the intenal node compression function when building the copath proofs.

rphmeier commented 2 months ago

Sure, that's a viable approach and we do stuff like that in NOMT, it just comes with other engineering trade-offs and a higher lift to get started afaik @eskimor is looking for a simple modification to Substrate as it is today.

eskimor commented 1 month ago

@bkchr just reminded me, that we should also check how fast we can get our hexary trie if we assume collators can keep everything in memory. If this turns out fast enough, we would already have something to offer as soon as elastic scaling is done (to accommodate for the larger PoV size requirements).

rphmeier commented 1 month ago

Probably quite a lot faster than fetching from disk. But note that with RocksDB the OS page cache is doing quite a lot of heavy lifting, and that for small databases (<64M items) everything probably already is in memory. i.e. it's mistaken to assume that the baselines you currently have are "from disk" unless you're working with large databases.

If you have a machine with 8GB of RAM and a couple million items, it's all already probably in RAM.

alexggh commented 4 weeks ago

Some updates, I did some investigations and discussed this a bit with @cheme and @arkpar.

Trie format

We have a few paths that we can follow here:

1. Extend the currently used trie crate to support a binary trie.

This would be basically a reviving of this ancient PR: https://github.com/paritytech/trie/pull/84, and integrate it in substrate, this has a few downsides like the fact that the current trie format is not optimised for the fact that we are dealing with a binary trie.

2. Implement a binary trie, where the format is optimized for the fact it is a binary trie.

Some optimizations ideas I got from @cheme:

- don't allow value in branch child: this way all branch got two children, everything go simpler
- storage of partialkey can be put to value: this is mostly an optimization and one that can be 
   transparent to the trie format (in proof we don't want that to be compact).
- no radix (don't store partialkey in node), then the state shall target balance tree 
  (already the case in many use case), and we store the partial key only in terminal node.

Note! We can't use the Jam binary trie implementation because that has some baked-in assumptions, like entire trie fits into memory, keys are of fixed sized and uniformly distributed.

3. Glue NOMT into substrate.

This would allow us to experiment with parachains running with it, what do you thinkg @rphmeier is this too early to think about it ?

Migration

Changing the trie format is something we already did once, more details about it can be found here: https://forum.polkadot.network/t/state-trie-migration/852 & https://hackmd.io/JagpUd8tTjuKf9HQtpvHIQ, it seems to be a painful process, but something doable, so I think we can follow the same approach, for whatever we chose.

Given that migrating between different trie formats is in the realm of possibility, I think that it make sense to go for this route, rather than try to have a backwards compatible format between different implementation, I think it would be better to pay the price once at migration time, rather than forcing all the implementations to pay the price in perpetuity of keeping a format that others are compatible with.

Other considerations.

There is already some working happening for parity-db trie accesses here: https://github.com/paritytech/polkadot-sdk/pull/3386, but that wouldn't improve our PoV size story it is more about improving the CPU time.

Not sure how this would work (efficiently). We currently have very long prefixes. Are we planning to build the entire path, instead of having one node containing the common prefix and then branch from there? I guess this could work if nodes are very compact and we can fetch multiple in bulk. :thinking:

burdges commented 4 weeks ago

Changing the trie format is something we already did once .. it seems to be a painful process, but something doable

We could make storage orthogonal from the rest of frame, meaning a small trait hierarchy of storage features implemented by rust types that represent different storage types.

In on_initialize you pull the NOMT state root out of the current trie, create a NOMT type in a static mut, proove accesses in the transactions, and then put the updated NOMT root back into the current trie in on_finalize.

If we detect that we're In block building then we invoke hostcalls. Alternatively we could hopefully detect that on_initialize never ran by these static muts being None, so then transaction code would fetch and put the NOMT root from the current trie themselves.

rphmeier commented 4 weeks ago

Glue NOMT into Substrate This would allow us to experiment with parachains running with it, what do you thinkg @rphmeier is this too early to think about it

I don't think it's too early to give it a try. However, NOMT currently doesn't support arbitrary-length keys, only 32-byte ones. Substrate storage needs arbitrary (or at least bounded-length) keys in order to support iteration and migrations. This assumption isn't baked too deep into NOMT, but it does require some changes to fully support Substrate.

Our underlying BTree and merkle tree won't have any issue with variable-length keys, though.

s0me0ne-unkn0wn commented 3 weeks ago

NOMT currently doesn't support arbitrary-length keys, only 32-byte ones.

How dumb is the idea of hashing Substrate keys with a 256-bit hasher to get NOMT-compatible keys? The hasher's throughput is insane, so it shouldn't significantly affect performance.

kayabaNerve commented 3 weeks ago

Then you don't get iteration/common prefixing of related storage. At least, not without reducing the bits of security (such as via allocating 2 bytes for a prefix, 30 bytes for the hash of the key).

rphmeier commented 3 weeks ago

Yeah, that's the main downside. You can try, but it'd break many pallets and it'd also remove the benefit of many parts of Substrate as an application development product. Hashing is definitely not the bottleneck!

Substrate's storage is highly dependent on arbitrary-length key formats in order to support iteration by prefix (and ordered iteration by prefix).

Though I do think it encourages doing key prefixing in a very inefficient way: you have 16 bytes for the pallet, then 16 bytes for the storage item, and then finally the key material. It should probably be a 16-byte prefix for either the pallet or the storage map, unless you really need to iterate all the storage in the pallet for some reason. Keeping keys as short as possible is always better.

My preferred 'future tech' is to superimpose a linked-list on an unordered merkle tree, though (the linked list connects leaves by key order). Blockchain state trees (trie, IAVL, verkle, etc.) have a largely unnecessary assumption that users need to be able to perform look-ups in the state tree. Removing that assumption lets you use hardware and data much more efficiently while still supporting proofs - the main purpose of a cryptographic state tree.

burdges commented 3 weeks ago

I'd expect common prefixing of related storage would typically occur in one or maybe two layers.

You could treat NOMT as a storage of storage roots, by placing blake3 encoded blobs at the leaves, and doing edit proofs into those blobs. If you only have one layer of common prefixing, like in accounts, then this costs you little.

paritytech / polkadot-sdk