Open arkpar opened 1 year ago
About plugin triedb to it, we currently don't always have this path: &[NodeKey]. This is mostly the case when reading value with LookUp, but in this case we can assume all sequential node query are next children, so there may be something simple to do at substrate level without changing too much the trie db code base.
For iterator, we got a stack of crumb containing option hash so not really doable to get &[NodeKey] memory wise, but could do something with a different api (iterator or some trait over sized array access).
For triedbmut we should limit to the modifying operation, but in this case access are not always the child of previous access (when merging with another children for instance), and we don't have an explicit stack (we use recursive call instead). So in this case I feel like it would need some change (it just means keeping a Vec
So in the end I think the easier might be to just extend the struct that is used to get current node prefix for rocksdb with a stack of node Hashes (at least for a quick and simple implementation).
If it seems pretty clear to me how to access the db, updating the state is currently not as clear (generally all I am thinking about is ordered node payload to be able to write in order with filling address of a stack of parent (same for removal)). This would be easier implemented by not using triedbmut at all and just run a diff in a single interation (I did write this in the past and it was doable but a bit tricky with node splitting and merging but doable)). Or could just address node with their path added to the hash (as accessing with get_node), but this sounds like something heavy to keep trace of in the changeset sent from state-machine to client.
Sounds like quite a few changes will be requred at the trie level. So it looks like it makes sense to have the node index at least initially, so the path won't be required:
fn get_node(col: ColId, node_key: &NodeKey) -> Result<Option<(Vec<u8>, Children)>>;
Would that work? We'd still need to rebuild the trie node from Vec<u8>
data and Children
. How involved would that be?
Regarding commits, the order of nodes inInsertTree
passed to commit
is not that important. The database may reorder internally. The order in which the nodes end up on disk initially is also not that important, as we want to rearrange stuff in the background anyway. What's important is that we get parent/child references from TrieDBMut
and pass them to commit
. For each inserted node TrieDBMut
or substrate would need to be able to split out children list and the rest of the node data.
We'd still need to rebuild the trie node from Vec
data and Children. How involved would that be?
Not really involved I think, and could even reuse the buffer since it would be decoded just after (I mean encoding would allow triedb to read expected data, but triedb should ultimately just instantiate OwnedNode from Vec
One small thing missing here is the ability to pass with the Operation the state version of the trie: need to be store on a per node basis, but would just be part of "the type of the node". Actually I think it is the only type information needed as 0 children is a leaf, others are branches (value presence should be cheap to check) and empty node is leaf without value at depth 0. Storage of depth per node may be an idea but I don t see it really needed here.
Another question would be if the partial key should be stored aligned, eg in case we got branch with nibbles "abc" and leaf at "d" with no partial. Do we store "abc" as "abc0" or do we pad as in the codec "0abc". I would tend to think it is better to pad where it is in the trie "abc0" when depth is even and "0d" when depth is odd. but that is just a detail (and not really useful since query initially would pas by the encoded form of the node).
One small thing missing here is the ability to pass with the Operation the state version of the trie: need to be store on a per node basis, but would just be part of "the type of the node". Actually I think it is the only type information needed as 0 children is a leaf, others are branches (value presence should be cheap to check) and empty node is leaf without value at depth 0.
This can be stored in NewNode::data
I guess. Each node has arbitrary data attached to it.
Another question would be if the partial key should be stored aligned
This is not really paritydb concern, as partial key encoding will be done at the trie
level. Partial key would go in the data
field as well. It is important however that the same NodeKey
correspond to the same node (data + children). So imagine there's the same branch node used at two differenct branches of the trie. One at even height and one at odd. Trie node encoding will be the same in both places and DB content should also be the same.
Updated the description for a simpler design. Instead of NodeKey
we can expose database addresses in the API and it will make things cleaner and easier. No need for the address cache.
@arkpar Are there any updates about this issue?
Some progress in https://github.com/paritytech/polkadot-sdk/pull/3386/ (I got to clean and add test before moving out of draft, but the referenced paritydb patch is master (trie patch got also substantial unmerged changes)).
Introduce a new column type:
MultiTree
intended for more efficient storage of blockchain state tries.Design goals:
Allow storing arbitrary trie structures in the database. Each tree is identified by a root key. There may be arbitrary number of roots. The column should optimize disk layout for tree traversal.
The API.
NodeAddress
is an opaque ID that allows for quick node retrieval. These IDs are defined by the database and will most likely be justTable::Address
Operation
enum is extended with two additional tree operations.Querying a root:
Querying non-root nodes:
Internal structure.
Hash index will be used for root nodes only. Nodes are stored in value tables. For Each node we store
N
pairs of(NodeKey, NodeAddress)
.N
is a compile time constant = 16.Removing trees.
There are two possible approches to implementing tree removal
NodeAddresses
. On deletion these are dereferenced and the journal record is deleted as well. These records may be stored in a separate value table. This should be faster overall, but potentially waste space for large tree sets.Background optimization.
The database will periodically resettle nodes within value tables to optimize on-disk tree layout. The idea here that iterating a recent trie root should result in sequential IO.
Potential issues:
Data duplicaton. The design as proposed here has no node index. If the same value node is inserted into 2 different paths of the same tree, it will be stored twice. If this turns out to be an issue we'd need to add each node to the index and use reference counting. We need to collect statistics for existing chains to estimate the impact for this.