waku-org / research

Waku Protocol Research
MIT License
3 stars 0 forks source link

Synchronization Integration with Waku Protocol #73

Open ABresting opened 11 months ago

ABresting commented 11 months ago

Synchronization Integration with Waku Protocol

The implementation of Prolly trees constitutes a critical component in the synchronization layer, necessitating construction based on specific attribute values. These values, notably timestamp and messageHash, serve as key-value pairs. This approach aligns seamlessly with the Waku protocol's functionalities. Utilizing messageHash enables the unique identification of messages, while the timestamp associated with Waku messages facilitates not only a semblance of chronological ordering but also renders these messages ideal candidates for range queries.

Indexing of Prolly trees:

Our objective is to develop Prolly trees in a manner that encapsulates the primary use cases of synchronization. These use cases predominantly involve synchronization based on pubsub topics and content topics. The Prolly trees, under this framework, function as a comprehensive structure within which specific trees can be constructed. This construction utilizes distinct key-value parameters, such as timestamp and messageHash, among possible others. It is important to note, however, that the storedAt attribute, being a locally generated timestamp, does not align with the requirements for a Prolly tree index and is therefore excluded from this indexing strategy.

Integration of Prolly Trees with Waku Protocol:

The synchronization protocol is envisaged to be developed as a middleware layer, leveraging the capabilities of the Waku Store and Archive protocols. This integration plays a pivotal role in the efficient handling and retrieval of data for synchronization purposes.

Key Steps in Prolly Tree Construction:

Data Retrieval from Waku Archive: The initial step involves fetching data from the Waku Archive, mostly freshly stored messages. This process is crucial for accessing historical and real-time data necessary for the construction of the Prolly tree.

In-Memory Key-Value Store Utilization: Once the data is retrieved, it is stored in an in-memory key-value store. This store acts as an index repository for data before it is structured into the Prolly tree.

Bottom-Up Construction of Prolly Tree: Utilizing the data from the key-value store, the Prolly tree is constructed in a bottom-up manner. This approach ensures efficient organization and retrieval of data.

Considerations for Key-Value Store Properties:

Message Acceptance Threshold: The key-value store should only include messages that are at least 20 seconds old, establishing a threshold for message acceptance.

Minimization of Random Modifications: Random alterations within the Prolly tree should be kept to a minimum, except in cases where the local Prolly tree is being populated with synchronized entries.

Serial Insertions: New entries should be inserted into the Prolly tree serially, starting from the right side. This method ensures consistency and order in the data structure.

Deletion Strategy: The removal of old entries should be executed from the left side of the tree. This can be based on a fixed time-bound criterion or by pruning older entries, maintaining the tree's efficiency and relevance.

It is important to note that both insertion and deletion operations are designed to have a negligible impact on the overall structure of the Prolly tree, thereby ensuring its stability and efficiency over time. This careful consideration in the design allows for effective synchronization and data management within the Waku protocol framework.

Waku Store and Synchronization Integration:

The integration of the Waku store protocol with the synchronization process is instrumental in facilitating several key tasks:

Retrieval of Waku Messages: The Waku store protocol is pivotal in accessing Waku messages. This process ensures that the necessary data for tree construction is readily available.

Interfacing with Peer Nodes: It also plays a crucial role in interfacing with synchronization-related tasks from other peer nodes, thereby enabling efficient data exchange and synchronization across the network.

It is imperative to maintain a clear distinction between the Store protocol and the Synchronization protocol. This modular abstraction not only simplifies maintenance but also enhances the scalability and extensibility of the system.

Waku Archive and Synchronization:

The use of the Waku archive in conjunction with Prolly trees is a straight-forward approach to access stored messages and their attributes at the local node level. Key considerations include:

Interface Implementation: An interface should be implemented to abstract away database-level details. This abstraction allows the Synchronization layer to access archive messages without being constrained by the specifics of the underlying database. Currently, we can leverage existing archive functions to support Prolly trees.

Database Agnosticism: Ensuring that the Synchronization layer can interact with the archive messages irrespective of the database in use is crucial. This agnosticism facilitates flexibility and adaptability in the system, allowing for seamless integration with various database technologies including and not limited to SQL, NoSQL and Codex.

In summary, the integration of Waku Store and Archive with the synchronization process, through well-defined interfaces and modular abstractions, is essential for efficient data retrieval, synchronization, and overall system scalability. This approach not only streamlines the synchronization process but also ensures that the system remains robust and adaptable to evolving technological landscapes.

jm-clius commented 11 months ago

Thanks for this. A couple of initial comments:

Data Retrieval from Waku Archive: The initial step involves fetching data from the Waku Archive

Not sure I understand this step. Why wouldn't we simply insert keys into the prolly tree in real time as we insert new messages into the archive? In other words, we assume the archive and the prolly tree view of current keys are populated from empty.

Once the data is retrieved, it is stored in an in-memory key-value store.

Presumably this is something we can test with a POC, but since we have fixed hash and timestamp sizes it should be possible to calculate roughly the memory impact this will have for x amount of messages. Making some assumptions on message rates (see e.g. https://github.com/waku-org/research/issues/22) we can then come up with some approximate memory-per-hour figure that will help us dimension the sync service. May be good to understand if there are non-memory alternatives here - an archive/db backend on disc that is compatible with prolly tree structure? No need for in-depth study, but part of understanding the moving parts at play here.

Message Acceptance Threshold: The key-value store should only include messages that are at least 20 seconds old

I may be missing a part of the picture here. Is this in order to avoid out-of-order insertions? Or is it rather a way to achieve simpler root comparisons without the churn of "real time" messages. It seems to me that out of order insertions will be unavoidable in any case when we synchronise stores and insert missing hashes. In terms of real-time inserts, we should indeed only see unordered messages for the past 20s (moving window) as we discard messages arriving at the node older than this. To me the simplest model seems to be inserting messages as they are processed by the relay node, noting that the rightmost 20s of the tree may have some out of order insertions. If comparison is easier within the past (probably because we believe nodes have a better chance of being synced up to that point), does this imply some cache where we keep "real time" messages in an ordered list until they're old enough to be included in the tree?

maintain a clear distinction between the Store protocol and the Synchronization protocol

In other words, this suggests a new protocol. May be good to be concrete as to what is expected here in a follow up to this post. Do we expect the Store protocol to undergo any changes? Something like: The synchronisation protocol is responsible for communicating between ... and ...; list the actors, type of interactions we'd expect The Store protocol needs to be expanded to support ... functionality.

ABresting commented 11 months ago

Not sure I understand this step. Why wouldn't we simply insert keys into the prolly tree in real time as we insert new messages into the archive? In other words, we assume the archive and the prolly tree view of current keys are populated from empty.

  1. real time insertion in PTree can be done assuming that the relayed messages are valid (the archive stores the valid messages), this is possible in store nodes. Not sure if we want to give light/filter nodes Sync protocol? if yes, then we need to make sure that the relayed messages are validated and then compute the messageHash afaik nodes other than Store do not store data even for temp time.
  2. we need messageHash as a key-value's value attribute, from archive/DB it's available.
ABresting commented 11 months ago

Presumably this is something we can test with a POC, but since we have fixed hash and timestamp sizes it should be possible to calculate roughly the memory impact this will have for x amount of messages. Making some assumptions on message rates (see e.g. #22) we can then come up with some approximate memory-per-hour figure that will help us dimension the sync service.

Yes, this is something we will totally do re-using some of traffic stats from RLN message rates. From our older conversations having a default 1 hour PTree is fine since the timestamp and messageHash are fixed size that will be at least a magnitude size smaller than whole/original message and Log(N) will be average height of PTree on N messages.

ABresting commented 11 months ago

Message Acceptance Threshold: The key-value store should only include messages that are at least 20 seconds old

I may be missing a part of the picture here. Is this in order to avoid out-of-order insertions? Or is it rather a way to achieve simpler root comparisons without the churn of "real time" messages. It seems to me that out of order insertions will be unavoidable in any case when we synchronise stores and insert missing hashes. In terms of real-time inserts, we should indeed only see unordered messages for the past 20s (moving window) as we discard messages arriving at the node older than this. To me the simplest model seems to be inserting messages as they are processed by the relay node, noting that the rightmost 20s of the tree may have some out of order insertions. If comparison is easier within the past (probably because we believe nodes have a better chance of being synced up to that point), does this imply some cache where we keep "real time" messages in an ordered list until they're old enough to be included in the tree?

Out-of order messages are fine since like you mention that upon sync the out-of-order insertion will be there. A buffer time of let's say 20sec. (open to also adopt a different number) is there so that the eventual delivery of messages is expected (possibly ordered). I am not sure exactly having a ordered cache va querying from archive store, will open an issue since not priority though.