waku-org / research

Waku Protocol Research
MIT License
3 stars 0 forks source link

Prolly Trees under Waku synchronization process. #71

Open ABresting opened 7 months ago

ABresting commented 7 months ago

Prolly Trees

The Prolly Tree, an amalgamation of Merkle and B-Tree architectures, is devised to streamline data searching and classification. This structure is pivotal in swiftly discerning discrepancies between two data stores, rendering it particularly advantageous for synchronization applications. The Prolly Tree operationalizes by hashing input data (mostly key-value) and generating a tree-like structure. It ensures that hash contents conforming to a specific pattern are clustered within the same subtree branch. The tree accommodates adjustable branching, controlled through a quotient parameter (Q) applied over the span of possible hash values. Q is used to calculate a threshold value. If the space used by the hash value generated of size 8 bytes i.e 264 size space, then 2Q acts as chunking factor.

Threshold = max_Hash_Size/Chunking_Factor.

For eg if hash value space size is 232 and chunking factor is 28 i.e. Q = 8. Then Threshold is 224 i.e. 16777216. So any new hash value below this parameter will end up in a new chunk formation, following values if not falling under this threshold will be part of the same chunk.

A Prolly Tree can be constructed through various methods, including insert-left, insert-right, and insert-random but the keys will be eventually ordered.

When constructing a Prolly Tree the process of comparing diffs between two nodes can be depicted as follows:

  1. Identical Prolly-tree Boundary Scenario: In this scenario depicted below, both Node 1 and Node 2 exhibit Prolly Trees with an identical foundational data structure, as illustrated in the corresponding diagram. This shared basis, whether delineated by message timestamps or messageHashes, facilitates a streamlined synchronization process. The advantage here lies in the feasibility of a top-down search approach, wherein the comparison of diffs initiates from the root nodes, progressing downwards. This method is inherently efficient due to the commonality in the tree structures.

boundary_usecase drawio

  1. Overlapping Boundaries Scenario: This scenario involves a more intricate synchronization process, as depicted in the accompanying diagram. Here, Node 2's Prolly Tree constitutes a subset of Node 1's tree. The synchronization challenge arises from the need to accurately locate the starting point for diff comparison within this subset. The initial step involves identifying the lowest-level subset of leaf nodes within Node 1's Prolly tree that aligns with Node 2's Prolly tree data range. Following this, the synchronization process targets the uppermost right-hand segment of Node 1's root, from where the diff calculation commences. This approach is necessitated by the partial overlap of the data ranges, requiring a focused comparison within the specific overlapping subtree.

subset_tree_usecase drawio

  1. No Common Intermediate Nodes Scenario: In situations where Node 1 and Node 2 have Prolly Trees with no shared intermediate nodes, the synchronization process becomes more comprehensive. As indicated in the provided diagram, this scenario represents a less favorable case, necessitating the synchronization of the entire range of leaf node contents. This is due to the absence of commonality at the intermediate levels, which significantly impacts the diff process. A notable instance of this is when a singular missing transaction alters the hash at level 1 in each tree, compelling the exchange of all leaf-level messageHash data to achieve synchronization. This scenario exemplifies the complexity and data-intensive nature of synchronization in the absence of shared intermediate structures.

worst_case drawio

Issues:

A notable limitation arises with linear or chronological data management in Prolly trees, especially when random edits are involved - such as modifications, insertions, or deletions. Locating specific data for these operations lacks a direct, expedited pathway. Although upto some degree it can be mitigated Using the following ways:

  1. Timestamp Boundaries in Leaf Buckets: Implementing timestamp boundaries within the leaf chunk buckets, which hold ordered messageHashes, enhances navigational efficiency. This structure enables more rapid access to the pertinent bucket, leveraging logarithmic complexity.

  2. Optimization with Flat Binary Search: Further refinement can be achieved using a flat binary search. By utilizing timestamps and other chronological ordering criteria, this search method can expedite the identification and access of specific data points within the tree.

  3. Employing Bloom Filters for Efficient Filtering: Utilizing Bloom filters presents another expedient method for identifying missing entries within the Prolly Tree. While inherently probabilistic, Bloom filters can significantly streamline the process of locating missing entries. Their probabilistic nature implies a trade-off between accuracy and efficiency, but they can provide substantial benefits in swiftly narrowing down the search space for potential missing or altered data points in the tree structure.

alrevuelta commented 7 months ago

Thanks!

Implementing timestamp boundaries within the leaf buckets, which hold ordered messageHashes, enhances navigational efficiency. This structure enables more rapid access to the pertinent bucket, leveraging logarithmic complexity

Could you elaborate more on this? Does this refer to some kind of grouping? Eg grouping leafs (messages) by eg hour?

A notable limitation arises with linear or chronological data management in Prolly trees, especially when random edits are involved - such as modifications, insertions, or deletions.

I haven't thought this through, but these properties of waku store might be relevant:

Just in case these properties help.

SionoiS commented 7 months ago

A Prolly Tree can be constructed through various methods, including insert-left, insert-right, and insert-random. In scenarios involving insert-right and insert-left, chronological ordering, such as timestamp sequencing of data or messages, can be preserved, leading to the formation of what may be termed an "ordered Prolly Tree."

A bit misleading Prolly trees are always ordered collections (Key only or Key-Value) if not then it's not a Prolly tree. As for the inserting I'm not sure what you mean.

A notable limitation arises with linear or chronological data management in Prolly trees, especially when random edits are involved - such as modifications, insertions, or deletions. Locating specific data for these operations lacks a direct, expedited pathway.

I'm not sure what you mean here either, Prolly trees were designed specifically for this purpose.

Timestamp Boundaries in Leaf Buckets: Implementing timestamp boundaries within the leaf buckets, which hold ordered messageHashes, enhances navigational efficiency. This structure enables more rapid access to the pertinent bucket, leveraging logarithmic complexity.

It would be much simpler to have 2 trees one for key=timestamp value=hashes, one for key=hashes value=none.

Optimization with Flat Binary Search: Further refinement can be achieved using a flat binary search. By utilizing timestamps and other chronological ordering criteria, this search method can expedite the identification and access of specific data points within the tree.

💯 Binary search is always used since keys are ordered.

Employing Bloom Filters for Efficient Filtering: Utilizing Bloom filters presents another expedient method for identifying missing entries within the Prolly Tree. While inherently probabilistic, Bloom filters can significantly streamline the process of locating missing entries. Their probabilistic nature implies a trade-off between accuracy and efficiency, but they can provide substantial benefits in swiftly narrowing down the search space for potential missing or altered data points in the tree structure.

No need IMO, Prolly trees are plenty fast.

jm-clius commented 7 months ago

Thank you for this summary.

I have a couple of questions:

linearly or chronologically ordered message hashes

Does this imply we need some natural sorting order for message hashes? Even though all messages have timestamps, we do not generate epoch-sortable hashes. Perhaps we should or perhaps we need to include the timestamp as significant part of the key? Or am I missing something?

Implementing timestamp boundaries within the leaf buckets

What is the tradeoff here? Do inserts become more expensive? Tree depth become more unpredictable?

Not for this post: In general I think a good next step is to contextualise this within the Store/Relay domain - considering our existing hashing mechanism, timestamping and other simplifying assumptions that @alrevuelta listed above: no modifications, no deletions, almost chronological inserts (we only allow up to 20s variance in timestamps in either direction for inserted messages), etc. Where would the prolly tree fit into our current Store <-> Archive (DB) model? Some middleware? Should the sync mechanism be separate from the Store protocol (I'm thinking more and more, yes, but it would be good to understand if anything we do can have/should have an effect on the Archive/DB structure, the Store protocol itself, etc.)

jm-clius commented 7 months ago

It would be much simpler to have 2 trees one for key=timestamp value=hashes, one for key=hashes value=none.

Interesting, although I may not quite get this idea yet. Do you mean that you have a tree of timestamp:hash_of_all_message_hashes_with_this_timestamp? Why wouldn't a tree build of key=time_sortable_hash value=none be more effective?

SionoiS commented 7 months ago

Interesting, although I may not quite get this idea yet. Do you mean that you have a tree of timestamp:hash_of_all_message_hashes_with_this_timestamp? Why wouldn't a tree build of key=time_sortable_hash value=none be more effective?

One tree would represent a table with keys of type int representing message timestamps, the values would be the hashes of messages.

The other tree would be a set of all the message hashes. :thinking: May not be that useful since range queries would be weird. Give me all hashes greater than X but less than Y !?!?

jm-clius commented 7 months ago

The other tree would be a set of all the message hashes. 🤔 May not be that useful since range queries would be weird. Give me all hashes greater than X but less than Y !?!?

Right, so I don't think I see the use of the second tree here. As long as the number of hashes per timestamp remains relatively low (we expect number of messages per second to be < 100), perhaps it may be okay-ish to just compare the hashes on each timestamp sequentially? However, this tree would indeed have modifiable values (new messages arrive for previous timestamp), conflicts, etc.

Give me all hashes greater than X but less than Y !?!?

Which again seems to me to imply some natural sorting order of hashes based on encoded timestamp.

Still not sure I see the advantage of this approach vs. some kind of key=timestamp,hash value=null

SionoiS commented 7 months ago

As long as the number of hashes per timestamp remains relatively low (we expect number of messages per second to be < 100), perhaps it may be okay-ish to just compare the hashes on each timestamp sequentially? However, this tree would indeed have modifiable values (new messages arrive for previous timestamp), conflicts, etc.

Timestamps should be unique, don't we use nanosecond precision? In my mind, one timestamp one hash, edge case be dammed. otherwise the system increase in complexity real fast!

We can store multiple hashes per key in the tree but it's annoying.

We could use the timestamp of the reception of messages too. Nope we can't. The tree would differ.

jm-clius commented 7 months ago

Timestamps should be unique, don't we use nanosecond precision?

Well, we have nanosecond-precision timestamp fields, but we definitely suggest timestamps to tick on coarse boundaries (by design), otherwise timing attacks becoming pretty trivial. We should expect timestamps around 500ms boundaries.

ABresting commented 7 months ago

Implementing timestamp boundaries within the leaf buckets, which hold ordered messageHashes, enhances navigational efficiency. This structure enables more rapid access to the pertinent bucket, leveraging logarithmic complexity

Could you elaborate more on this? Does this refer to some kind of grouping? Eg grouping leafs (messages) by eg hour?

sure, Leaf nodes are actual chunks of relevant data following some bucketing mechanism i.e. if the rolling hash of the transactions before falls under a category (eg. aggregated hash is less than some threshold value) so that grouping is like this, it is not based on timelined buckets (although they will be great though).

A notable limitation arises with linear or chronological data management in Prolly trees, especially when random edits are involved - such as modifications, insertions, or deletions.

I haven't thought this through, but these properties of waku store might be relevant:

  • Modifications will be rare or even non-existent. Once a message is added to the store protocol it shouldn't be modified.

Indeed but just documenting it in case for broader understanding.

  • The same applies to deletion. A valid message is always stored and shall not be deleted. The only use case where a message is deleted is when "prunning" the node, eg you want to just store the last 7 days of history. If chronologically sorted, that would be "left hand" running.

  • Most of the time insertions will be chronological. Meaning that new timestamps will be higher. So if the tree is sorted chronologically, "right hand" insertions will be predominant.

Yes in the ideal case there will be no more insertions (except sync replies), if we construct the Prolly tree on entries older than 20 seconds (limit for updated messages).

ABresting commented 7 months ago

A bit misleading Prolly trees are always ordered collections (Key only or Key-Value) if not then it's not a Prolly tree. As for the inserting I'm not sure what you mean.

oh yes this statement needs more clarity I will amend that. But here I mean the flow of incoming events/messages, random means the flow of keys can be disorder at the time of creation but eventually ordered. In our case, we take entries 20 seconds older (outdated message limit) so the order will be predefined (except the missing entries filled by Sync process) i.e. insert from right.

A notable limitation arises with linear or chronological data management in Prolly trees, especially when random edits are involved - such as modifications, insertions, or deletions. Locating specific data for these operations lacks a direct, expedited pathway.

I'm not sure what you mean here either, Prolly trees were designed specifically for this purpose.

I am talking about a case where imagine the leaf entries are just messageHashes and no key-value hashed together. In that case, we get a level 0 chunk/bucket where there can be 5 leaf nodes or 50000 (based on parameter/input set). To not resend the whole chunk/bucket we can find the starting messageHash which is causing the upper-level mismatch. This process might take time to find the starting range.

Timestamp Boundaries in Leaf Buckets: Implementing timestamp boundaries within the leaf buckets, which hold ordered messageHashes, enhances navigational efficiency. This structure enables more rapid access to the pertinent bucket, leveraging logarithmic complexity.

It would be much simpler to have 2 trees one for key=timestamp value=hashes, one for key=hashes value=none.

But I think here is an assumption that for a timestamp there can be multiple messages, so idk how this timestamp-based key-value store will work. Imagine user said sync me between X and Y time stamp, so to get X we need to hash it with it's value, but values can be different, so collision.

ABresting commented 7 months ago

Timestamps should be unique, don't we use nanosecond precision?

Well, we have nanosecond-precision timestamp fields, but we definitely suggest timestamps to tick on coarse boundaries (by design), otherwise timing attacks becoming pretty trivial. We should expect timestamps around 500ms boundaries.

@jm-clius @SionoiS So basically I am trying to avoid having two Prolly trees, if using some clever trick we can track down the first difference in messageHash at level 0 leaf chunk/bucket then answering the range quires can be easily handled at store/archive level itself. For eg. If given a root Hash and the range that Sync me between X timestamp and Y timestamp then even in the case of "one timestamp multiple Hashes" we can get the messageHash of first occurrence of timestamp X and for Y timestamp last occurrence. So Sync between those might give us an optimal messageHashes Sync set.

SionoiS commented 7 months ago

But I think here is an assumption that for a timestamp there can be multiple messages, so idk how this timestamp-based key-value store will work. Imagine user said sync me between X and Y time stamp, so to get X we need to hash it with it's value, but values can be different, so collision.

For a read you don't need to hash anything, only inserts/deletes to determine "chunk" boundaries.

If we have multiple values per key, Sync is less efficient because you loose granularity.

SionoiS commented 7 months ago

Had a good discussion with @ABresting, we probably need to allow multiple values per key in the case of a time index but loose on sync efficiency as time precision decrease.

Also, we could limit the type of queries by not building every possible index OR do the opposite and build every index and allow arbitrary queries. I'm not sure the of the advantages of either.

Lastly, the chunking algorithm could be based on rolling hashes but can be change later, the structure of the tree is agnostic to this.

SionoiS commented 7 months ago

I just had an idea. If we need multiple values per key then could sort the values too. That way if we choose a tree shape with big leafs (chunks) searching this chunk is more efficient (only work if values can be sorted) and deterministic.

ABresting commented 7 months ago

I just had an idea. If we need multiple values per key then could sort the values too. That way if we choose a tree shape with big leafs (chunks) searching this chunk is more efficient (only work if values can be sorted) and deterministic.

so a BTree inside the multiValue key? regarding some sort of sorting order, then size or chronological ordering based on first 5 bytes of messageHash can be highly deterministic?