waku-org / research

Waku Protocol Research
MIT License
3 stars 0 forks source link

Prolly Trees under Waku synchronization process. #70

Closed ABresting closed 10 months ago

ABresting commented 10 months ago

Prolly Trees

The Prolly Tree, an amalgamation of Merkle and B-Tree architectures, is devised to streamline data searching and classification. This structure is pivotal in swiftly discerning discrepancies between two data stores, rendering it particularly advantageous for synchronization applications. The Prolly Tree operationalizes by hashing input data and generating a tree-like structure. It ensures that hash contents conforming to a specific pattern are clustered within the same subtree branch. The tree accommodates adjustable branching, controlled through a quotient parameter (Q) applied over the span of possible hash values. A node is designated as the first child in its parental branch if the initial segment of its hash, when interpreted as an unsigned 32-bit integer, falls below the ratio of the hash space to Q.

A Prolly Tree can be constructed through various methods, including insert-left, insert-right, and insert-random. In scenarios involving insert-right and insert-left, chronological ordering, such as timestamp sequencing of data or messages, can be preserved, leading to the formation of what may be termed an "ordered Prolly Tree."

When constructing a Prolly Tree using linearly or chronologically ordered message hashes for synchronization purposes, the process of comparing diffs between two nodes can be depicted as follows:

  1. Identical Prolly-tree Boundary Scenario: In this scenario depicted below, both Node 1 and Node 2 exhibit Prolly Trees with an identical foundational data structure, as illustrated in the corresponding diagram. This shared basis, whether delineated by message timestamps or messageHashes, facilitates a streamlined synchronization process. The advantage here lies in the feasibility of a top-down search approach, wherein the comparison of diffs initiates from the root nodes, progressing downwards. This method is inherently efficient due to the commonality in the tree structures.

perfectly_aligned_usecase drawio

  1. Overlapping Boundaries Scenario: This scenario involves a more intricate synchronization process, as depicted in the accompanying diagram. Here, Node 2's Prolly Tree constitutes a subset of Node 1's tree. The synchronization challenge arises from the need to accurately locate the starting point for diff comparison within this subset. The initial step involves identifying the lowest-level subset of leaf nodes within Node 1's Prolly tree that aligns with Node 2's Prolly tree data range. Following this, the synchronization process targets the uppermost right-hand segment of Node 1's root, from where the diff calculation commences. This approach is necessitated by the partial overlap of the data ranges, requiring a focused comparison within the specific overlapping subtree.

subset_tree_usecase drawio

  1. No Common Intermediate Nodes Scenario: In situations where Node 1 and Node 2 have Prolly Trees with no shared intermediate nodes, the synchronization process becomes more comprehensive. As indicated in the provided diagram, this scenario represents a less favorable case, necessitating the synchronization of the entire range of leaf node contents. This is due to the absence of commonality at the intermediate levels, which significantly impacts the diff process. A notable instance of this is when a singular missing transaction alters the hash at level 1 in each tree, compelling the exchange of all leaf-level messageHash data to achieve synchronization. This scenario exemplifies the complexity and data-intensive nature of synchronization in the absence of shared intermediate structures.

worst_case drawio

Issues:

A notable limitation arises with linear or chronological data management in Prolly trees, especially when random edits are involved - such as modifications, insertions, or deletions. Locating specific data for these operations lacks a direct, expedited pathway. Although upto some degree it can be mitigated Using the following ways:

  1. Timestamp Boundaries in Leaf Buckets: Implementing timestamp boundaries within the leaf buckets, which hold ordered messageHashes, enhances navigational efficiency. This structure enables more rapid access to the pertinent bucket, leveraging logarithmic complexity.

  2. Optimization with Flat Binary Search: Further refinement can be achieved using a flat binary search. By utilizing timestamps and other chronological ordering criteria, this search method can expedite the identification and access of specific data points within the tree.

  3. Employing Bloom Filters for Efficient Filtering: Utilizing Bloom filters presents another expedient method for identifying missing entries within the Prolly Tree. While inherently probabilistic, Bloom filters can significantly streamline the process of locating missing entries. Their probabilistic nature implies a trade-off between accuracy and efficiency, but they can provide substantial benefits in swiftly narrowing down the search space for potential missing or altered data points in the tree structure.

SionoiS commented 10 months ago

A Prolly Tree can be constructed through various methods, including insert-left, insert-right, and insert-random. In scenarios involving insert-right and insert-left, chronological ordering, such as timestamp sequencing of data or messages, can be preserved, leading to the formation of what may be termed an "ordered Prolly Tree."

A bit misleading Prolly trees are always ordered collections (Key only or Key-Value) if not then it's not a Prolly tree. As for the inserting I'm not sure what you mean.

A notable limitation arises with linear or chronological data management in Prolly trees, especially when random edits are involved - such as modifications, insertions, or deletions. Locating specific data for these operations lacks a direct, expedited pathway.

I'm not sure what you mean here either, Prolly trees were designed specifically for this purpose.

Timestamp Boundaries in Leaf Buckets: Implementing timestamp boundaries within the leaf buckets, which hold ordered messageHashes, enhances navigational efficiency. This structure enables more rapid access to the pertinent bucket, leveraging logarithmic complexity.

It would be much simpler to have 2 trees one for key=timestamp value=hashes, one for key=hashes value=none.

Optimization with Flat Binary Search: Further refinement can be achieved using a flat binary search. By utilizing timestamps and other chronological ordering criteria, this search method can expedite the identification and access of specific data points within the tree.

:100: Binary search is always used since keys are ordered.

Employing Bloom Filters for Efficient Filtering: Utilizing Bloom filters presents another expedient method for identifying missing entries within the Prolly Tree. While inherently probabilistic, Bloom filters can significantly streamline the process of locating missing entries. Their probabilistic nature implies a trade-off between accuracy and efficiency, but they can provide substantial benefits in swiftly narrowing down the search space for potential missing or altered data points in the tree structure.

No need IMO, Prolly trees are plenty fast.

SionoiS commented 10 months ago

duplicate of #71