paritytech / parity-db

Experimental blockchain database
Apache License 2.0
264 stars 61 forks source link

Tree column #199

Open arkpar opened 1 year ago

arkpar commented 1 year ago

Introduce a new column type: MultiTree intended for more efficient storage of blockchain state tries.

Design goals:

Allow storing arbitrary trie structures in the database. Each tree is identified by a root key. There may be arbitrary number of roots. The column should optimize disk layout for tree traversal.

The API.

type NodeAddress = u64;
type Children = Vec<NodeAddress>;

NodeAddress is an opaque ID that allows for quick node retrieval. These IDs are defined by the database and will most likely be just Table::Address

Operation enum is extended with two additional tree operations.

pub enum Operation<Key, Value> {
    InsertTree { root: Key, root: NewNode }, 
    RemoveTree { root: Key },
//...
}

pub enum NodeRef {
    New(NewNode),
    Existing(NodeAddress),
}

struct NewNode {
    data: Vec<u8>,
    children: Vec<NodeRef>, // May reference the node in the same commit or an existing node.
}

Querying a root:

fn get_root(col: ColId, key: Key) -> Result<Option<(Vec<u8>, Children)>>;

Querying non-root nodes:

fn get_node(col: ColId, node_address:  NodeAddress) -> Result<Option<(Vec<u8>, Children)>>;

Internal structure.

Hash index will be used for root nodes only. Nodes are stored in value tables. For Each node we store

Removing trees.

There are two possible approches to implementing tree removal

  1. Journaling. For each root the database records all added value NodeAddresses. On deletion these are dereferenced and the journal record is deleted as well. These records may be stored in a separate value table. This should be faster overall, but potentially waste space for large tree sets.
  2. Reference counting. For each node we store a counter of all the parent nodes that reference it directly. When a node is deleted, we read all direct children and dereference them. The downside here is that inserting or removing a node would require updating up to 16 children nodes, potentially blowing up the IO.

Background optimization.

The database will periodically resettle nodes within value tables to optimize on-disk tree layout. The idea here that iterating a recent trie root should result in sequential IO.

Potential issues:

Data duplicaton. The design as proposed here has no node index. If the same value node is inserted into 2 different paths of the same tree, it will be stored twice. If this turns out to be an issue we'd need to add each node to the index and use reference counting. We need to collect statistics for existing chains to estimate the impact for this.

cheme commented 1 year ago

About plugin triedb to it, we currently don't always have this path: &[NodeKey]. This is mostly the case when reading value with LookUp, but in this case we can assume all sequential node query are next children, so there may be something simple to do at substrate level without changing too much the trie db code base.

For iterator, we got a stack of crumb containing option hash so not really doable to get &[NodeKey] memory wise, but could do something with a different api (iterator or some trait over sized array access).

For triedbmut we should limit to the modifying operation, but in this case access are not always the child of previous access (when merging with another children for instance), and we don't have an explicit stack (we use recursive call instead). So in this case I feel like it would need some change (it just means keeping a Vec along the recursive call should be ok to do, especially as it will be following the same rules as the NibbleFullkey vec that is use to keep trace of the prefix for rocksdb).

So in the end I think the easier might be to just extend the struct that is used to get current node prefix for rocksdb with a stack of node Hashes (at least for a quick and simple implementation).

If it seems pretty clear to me how to access the db, updating the state is currently not as clear (generally all I am thinking about is ordered node payload to be able to write in order with filling address of a stack of parent (same for removal)). This would be easier implemented by not using triedbmut at all and just run a diff in a single interation (I did write this in the past and it was doable but a bit tricky with node splitting and merging but doable)). Or could just address node with their path added to the hash (as accessing with get_node), but this sounds like something heavy to keep trace of in the changeset sent from state-machine to client.

arkpar commented 1 year ago

Sounds like quite a few changes will be requred at the trie level. So it looks like it makes sense to have the node index at least initially, so the path won't be required:

fn get_node(col: ColId, node_key:  &NodeKey) -> Result<Option<(Vec<u8>, Children)>>;

Would that work? We'd still need to rebuild the trie node from Vec<u8> data and Children. How involved would that be?

Regarding commits, the order of nodes inInsertTree passed to commit is not that important. The database may reorder internally. The order in which the nodes end up on disk initially is also not that important, as we want to rearrange stuff in the background anyway. What's important is that we get parent/child references from TrieDBMut and pass them to commit. For each inserted node TrieDBMut or substrate would need to be able to split out children list and the rest of the node data.

cheme commented 1 year ago

We'd still need to rebuild the trie node from Vec data and Children. How involved would that be?

Not really involved I think, and could even reuse the buffer since it would be decoded just after (I mean encoding would allow triedb to read expected data, but triedb should ultimately just instantiate OwnedNode from Vec and Children directly). For proof building/recording, similarly, the proof could use a different format that is more close to the data and abstract from the codec and building the trie node would only be done on proof checking (such as https://github.com/w3f/PSPs/pull/56).

One small thing missing here is the ability to pass with the Operation the state version of the trie: need to be store on a per node basis, but would just be part of "the type of the node". Actually I think it is the only type information needed as 0 children is a leaf, others are branches (value presence should be cheap to check) and empty node is leaf without value at depth 0. Storage of depth per node may be an idea but I don t see it really needed here.

Another question would be if the partial key should be stored aligned, eg in case we got branch with nibbles "abc" and leaf at "d" with no partial. Do we store "abc" as "abc0" or do we pad as in the codec "0abc". I would tend to think it is better to pad where it is in the trie "abc0" when depth is even and "0d" when depth is odd. but that is just a detail (and not really useful since query initially would pas by the encoded form of the node).

arkpar commented 1 year ago

One small thing missing here is the ability to pass with the Operation the state version of the trie: need to be store on a per node basis, but would just be part of "the type of the node". Actually I think it is the only type information needed as 0 children is a leaf, others are branches (value presence should be cheap to check) and empty node is leaf without value at depth 0.

This can be stored in NewNode::data I guess. Each node has arbitrary data attached to it.

Another question would be if the partial key should be stored aligned

This is not really paritydb concern, as partial key encoding will be done at the trie level. Partial key would go in the data field as well. It is important however that the same NodeKey correspond to the same node (data + children). So imagine there's the same branch node used at two differenct branches of the trie. One at even height and one at odd. Trie node encoding will be the same in both places and DB content should also be the same.

arkpar commented 1 year ago

Updated the description for a simpler design. Instead of NodeKey we can expose database addresses in the API and it will make things cleaner and easier. No need for the address cache.

kogeler commented 8 months ago

@arkpar Are there any updates about this issue?

cheme commented 8 months ago

Some progress in https://github.com/paritytech/polkadot-sdk/pull/3386/ (I got to clean and add test before moving out of draft, but the referenced paritydb patch is master (trie patch got also substantial unmerged changes)).