w3f / polkadot-spec

The Polkadot Protocol Specification
https://spec.polkadot.network
Creative Commons Attribution Share Alike 4.0 International
180 stars 70 forks source link

Fixup merkle-value #698

Closed bkchr closed 12 months ago

bkchr commented 12 months ago

I assume the merkle value follows how we declare an inline node in the trie crate? If yes, then the size of the node value can be up to 32 bytes.

tjjfvi commented 12 months ago

Where is this defined in the source? I only found this, which would suggest that it's only inline if len < 32, like in the original text.

bkchr commented 12 months ago

https://github.com/paritytech/trie/blob/c4be095a88d32cdfcfbf2e3740d6a6764b36207a/trie-db/src/triedbmut.rs#L119

bkchr commented 12 months ago

And threshold being 33.

tjjfvi commented 12 months ago

As far as I can tell, the code you linked relates to whether or not a node has a hashed subvalue (see this, which links the usage of that threshold constant to the state version, and this, which shows that the Value type linked is not related to children).

The section of the spec edited in this PR does not relate to the hashed subvalue, but rather the merkle value of a trie node.

(The format of the hashed subvalue is, except for how it changes the header, almost entirely undocumented in the spec; there is no mention of the threshold you link, nor even how the format of the subvalue changes.)

bkchr commented 11 months ago

I honestly have no idea what the "merkle value" is used for. Our trie crate doesn't use this term nor does it cares. However, I would be surprised why the threshold that decides on inline child nodes should be different to the value that decides if the merkle value is a hash or the node value.

Maybe @cheme can shine some light here on what merkle value is used for. I can not recall on anything outside of the encoding that should care if something is a hash or a value.

cheme commented 11 months ago

Yes I am not really using "merkle value" wording usually, but it is used in the rpc spec https://github.com/paritytech/json-rpc-interface-spec/pull/37. In this proposal the goal is to subscribe on this to get alerted if content starting with this node prefix do change (hash or inline node will be different). Actually for sake of simplicity we could use directly hash of inline node to, so we only return hash nodes, but it would be a bit more costly on size. Having same threshold for inline node and value nodes comes from the same reasoning: saving space by avoiding a 32 byte hash storage when storing directly node content or value would take less or equal size.

Actually for inline node the limit would have been better at 31 (since for 32 byte we encode the size and use 33 bytes in the end). Also 33 in code means start using value node from 33 bytes (so up to 32 bytes, inline values are used). So it kind of a lower value threshold for inline value which can be a bit confusing (probably could change to a threshold for node value which would be 32).

tjjfvi commented 11 months ago

I can not recall on anything outside of the encoding that should care if something is a hash or a value.

If one has only partial information about the trie (i.e. only has the preimages of some of the node hashes, as in the case of a light client), when attempting to read from the trie, one must determine if a child is inlined or hashed. The only way to do so is to check the length – if it's <= 31, it's inline; otherwise, it's hashed.

(For values, on the other hand, whether it's inline or hashed is explicitly stored in the node header.)


If nodes are in fact inlined when len < 32, this PR should be reverted.

cheme commented 11 months ago

It is when node < 33 (<= 32). I was just mentioning < 32 would have been better.

tjjfvi commented 11 months ago

@cheme Then can you link to where in the source that is defined/handled?

What I linked strongly indicates < 32, and what @bkchr linked does not relate to inline nodes, but rather inline values.

cheme commented 11 months ago

yes you're right, I got bad memory. Inline node are actually from 0 to 31 encoded len, then one byte is used to encode the len in the node encoding which result to 32 byte so good. (you can see more in triedbmut and lookup (same thing look for comparison againt LENGTH from hasher (32 for blake)). For inline value, there is though this issue that a 32 byte value is inline encoded to 33 byte (with the additional 1 byte). So the threshold is different between both. (for code look for comparison against MAX_INLINE_VALUE (33 in polkadot-sdk/primitive/trie crate)). For inline value, it means that the many node that only store a hash will not use an external node (so little cost when accessing node but not value, but overall might be better given we avoid splitting there).

bkchr commented 11 months ago

For inline value, there is though this issue that a 32 byte value is inline encoded to 33 byte (with the additional 1 byte). So the threshold is different between both. (for code look for comparison against MAX_INLINE_VALUE (33 in polkadot-sdk/primitive/trie crate)).

W00t.... Why do we have done this..?

cheme commented 11 months ago

For inline value, there is though this issue that a 32 byte value is inline encoded to 33 byte (with the additional 1 byte). So the threshold is different between both. (for code look for comparison against MAX_INLINE_VALUE (33 in polkadot-sdk/primitive/trie crate)).

W00t.... Why do we have done this..?

tbh partially an overlook, but at the same time I cannot say if putting cursor for using a value node at 32 or 33 or even 40 is the better choice. strictly for proof size 32 looks better, but this also means that hash storage (which is very commont) will always use an external node value in db. also considering we currently put proof as a sequence of Vec where all vec are scale encoded we have an additional size byte for the value node too. I mean a single node trie with a 32 byte value will encode as node to 3 + 1 + 32 = 36 bytes then as proof (Vec<Vec>): 1 + 1 + 36 = 38. If the value node started at 32 byte and not 33, it would encode to 3 + 32 = 35 bytes (1 less), but as this is two different nodes the Vec<Vec> will encode to 1 + 1 + 1 + 35 = 38.

So unless we do some smarter proof encoding, this is not really impacting at the time (only impacting in a good way).

bkchr commented 11 months ago

IMO it would just have been easier to consume if inline node and inline value use the same threshold. Deciding what exactly this threshold should be, would be some extra step. Just for understanding the entire structure it would have been easier. But yeah, we are now too late :P