Closed bnewbold closed 6 years ago
Ugh, I made a super detailed response to this that got lost somehow (prob my fault). Here we go again:
InflatedEntry
is an internal implementation detail, to work around an issue where the optional values of the Entry
message would be encoded in the payload. This is small optimization to work around that. I wouldn't put that in the spec.inflated
field is a seq pointer to the last message where the optional InflatedEntry
fields are populated.contentFeed
is an optional field containing the public key of a feed you want to attach to store big values in. The idea being you want to store lots of data in the db, but don't want to read that data out when walking the trie, and potentially want to chunk the data. The usecase comes from hyperdrive but plenty of applications outside which is why it's here. You should only set this once and keep copying it in the InflatedEntry
message so it can be found quickly using the inflated
pointer.contentFeed
out of the header message and just have it in the latest InflatedFeed
instead. This makes the header more static and only contain things like version, protocol etc. The motivation for moving it into the InflatedEntry
was that then you only need to read that one message.Yes, so we wanna encode pairs of (index, value, feed, seq)
efficiently. Index
is the trie index , i
above, value
is the bit value (0-4). All unsigned integers.
The way we do it is this. First encode the index
as a varint. Then write a bitfield where the bits reflect with values are present for this index (right to left). For example bitfield 0b1
means the value 0 is present. 0b11
that 0 and 1 are present. 0b10
that only ` is present, etc.
After the bitfield comes the (feed, seq) pairs. So for each value there can be more than one pair (forks, hash collisions). Usually it's only 1 though. To pack this we add one bit (the more
one) to donate whether the next feed,seq pair corresponds to the same value. The more bit is packed into the feed seq using feed << 1 | more
because the feed value is usually a very small number and therefore has bits to spare.
@bnewbold reopen if more questions
@mafintosh sorry for the super long delay in coming back to this, but i'm still confused about trie encoding.
My assumption (and what I have written in the DEP) is that the trie
field for any given Entry (protobuf field) in the field has only the trie
entries for it's "branch of the trie", based on that entry's path array; it doesn't have a copy of the entire trie
. Each bucket in the trie might have pointers for up to 4 branches/children total (because there are 4 possible values at each node of the trie), unless there are hash collisions (handled by the "more" bit).
Given that, i'm confused what the "trie index" is that you mention (index into the trie array? why would this be prefixed to the encoding? the length of the array? the index of the entry containing this trie array?). Also confused by "pairs of (index, value, feed, seq)"; did you mean "pairs of (feed, seq)"?
I'm going to write up my best guess in DEP form and send to you for review.
I had some coffee and thought about it more, and I think I understand now:
A "bucket" is an entry in the trie array, corresponding to a node in the trie. Each node can have up to 4 "children" (2-bit value), so a bucket can have up to 4 pointer groups (though in practice I think only 3 would be encoded, because we don't encode for our own "value"/branch. The i
is the index of the bucket; there might be multiple bucket entries. The bitfield is short, at most 4 bits (eg, 0b10
could be written 0b0010
), and is referencing the possible 2-bit values.
I was confused earlier because I thought the bitfield was referencing buckets (aka, could have length up to the length of the trie array). I also hadn't really understood that the trie is indeed a trie and has up to 4 children at a given node (not just a single (feed, entry) pair in a given bucket).
IRC follow up:
03:30 <mafintosh> It's actually 5 values in trie
03:31 <mafintosh> 4 (last one) is special
03:31 <mafintosh> And means END_OF_ENTRY
03:32 <mafintosh> Has to be the last entry in your path
[snip]
03:50 <bnewbold> right, I guess it's possible for the trie to branch on a slot
that has the special END_OF_ENTRY value
03:50 <bnewbold> should probably add an example with that
03:51 <bnewbold> (I like to demonstrate edge cases, not just mention them)
04:01 <mafintosh> It would branch on EOE on hash conlissions or some conflict
scenarios
04:04 <bnewbold> I guess hash collisions always/only are on EOE
04:04 <bnewbold> are there ever multiple ('more') pointers/branches where there
*isn't* a hash collision?
04:05 <bnewbold> I guess I should mention in there that even if the full
trie/path matches it's important to check in the last bucket
to see if there are collision pointers
04:05 <bnewbold> i'm glad somebody generated a hash collision that we can
include in tests
I've been updating the hyperdb and multi-writer specs to track changes from the past week in hyperdb v3. A few questions that I haven't been able to glean from the source:
Why bother with a separate
InflatedEntry
protobuf message type, and how does the reader know which message type a given hypercore entry is? Eg, why not just always use InflatedEntry and leave thefeeds
andcontentFeed
fields blank if they aren't needed?Perhaps related to the above, what does the
inflated
field (type varint) indicate, and how is it used?How does the
trie
binary encoding work? I expected arepeated
array of{feed, index}
messages, but it's byte-packed. This one comment intrie-encoding.js
is extremely terse and should be expanded upon:My guess is that
i
is a varint of the length of the trie array,bitfield
is bytes (or a varint of bits?) indicating whether the given array index has an element (val
) or is empty (is this bits or bytes? left-packed or right-packed?). There are Nval
entries following, one for each high-bit inbitfield
. The format is two varints: the first is the feed index, right-shifted one bit (with the low bit indicating... "more"? more what?) and the second varint being the sequence.contentField
is not described. I assume it's used for hyperdrive-like applications. Is there more to it than just including in some early (first?) InflatedEntry message? Can subsequent InflatedEntry messages update or override? That seems complicated; a peer would need to sync the entire feed to find what the contentFeed key is to replicate it... maybe inflated is a pointer back to the most recent inflated entry index?Is the use of contentFeed to indicated that a special metadata entry should not be included as the first message in a hyperdrive+hyperdb feed, containing the public key of the "content" feed?