vacp2p / research

Thinking in code
MIT License
62 stars 4 forks source link

Fault-Tolerant Waku Store Protocol #64

Open staheri14 opened 3 years ago

staheri14 commented 3 years ago

Problem

The current design of store protocol is not fault-tolerant that is when a peer goes offline, it will have to risk missing all the messages transmitted in the network during its off time. This also means that other peers who rely on that store peer are subject to this message history gap. We would like to make store protocol more fault-tolerant in such a way that store peers can safely go offline and be guaranteed that when they get back online they would be able to retrieve the history of messages from other store peers.

The current issue outlines a minimal solution with some rough implementation details. Further issues will be followed as the solution grows into a more mature state.

Problem structure

We can break this problem into three phases:

  1. Failure/history gap detection: Peers may fail to capture the intact message history and become inconsistent with other store nodes due to various reasons, for example, a peer may miss some of the messages due to a poor network connection, the network topology, a sudden hardware failure, or just simply by going offline. The peer needs to detect its failure and recover from that incomplete state.
  2. Discovery: Once a peer becomes aware of a gap in its message history, it needs to find other peers of the store protocol who have the missing messages.
  3. Transmission: This phase is to fetch the missed messages from another store node(s).

Initial solution: time window query

In the initial solution, we consider

  1. that a peer intentionally goes offline (so no crash or sudden failure).
  2. A basic discovery method where a peer has a hardcoded list of store peers to query from. This can also be an input to the protocol.
  3. Messages are transmitted via a direct connection between the two peers.

The high-level idea is as follows: A peer has connections to a set L = {P1,...PN} of other store peers (let's call them backup peers). A peer P in the store protocol keeps track of its on and off time. Let's say P knows he has been off for the time window W. P loops through all Pi in L:

Assumptions

In this solution we assume that store nodes do not lie about the history of messages (no censorship and no message injection).

The implementation path

Based on what I can see in the store protocol, we can build the time window query, more specifically the data transmission part, relying on the pagination feature which is already implemented. Just we need to take into account that:

In order to capture the off time window, we can follow two different approaches:

  1. whenever a node mounts the store protocol, it looks at the Unix timestamp of the most recent message in its DB, and considers it as the last time it has been on (thus queries all the messages after that time).
  2. the node keeps a persistent local time variable to keep track fo the last time it has been on. This second approach shrinks the size of the queried window so the chances of finding another peer who has been online during that entire window will be higher.

Potential problems

In this minimal solution, as the discovery is manual, we cannot guarantee 100% data availability. It is because the backup peers may also not have the messages the store peer is looking for. As such, the backup peers must be selected deliberately let's say based on their availability or geographical distribution.

Immediate improvements to the time window query

Future Steps and Consideration

The current issue is inline with https://github.com/vacp2p/research/issues/38

staheri14 commented 3 years ago

@cammellos Looking forward to your comments and thoughts on this!

cammellos commented 3 years ago

@staheri14 thanks, it looks like a solid foundation and very clear explanation! I don't have much to add as you seem to have covered all the fundamentals, only some observations that might be helpful to shape the future strategy (or not :) ).

Regarding clocks, in waku/whisper clocks are pseudo-synchronized (+/- 20 seconds), by only allow message flowing between peers that are within this span. The side effect of this is that temporal queries sort of works, as any message is +/- 20s of the wall clock, which makes querying by time effective (we always extend the range in the same way you described). Is there any similar mechanism in waku2? (The assumption is that we use sender time)

Another issue that waku/whisper addresses is injection of messages in the stream. Basically you need to "trust" history nodes as generally you would not accept messages older/newer than +/- 20s of your clock, but with history nodes that's necessary, as you'd be receiving older messages. This effectively makes history nodes kind of special nodes and not just any nodes.

It might be that these are not actual concerns, as it could be that waku-2 handles things quite differently, but worth pointing it out.

But looks all good in general, nice work!

staheri14 commented 3 years ago

Thanks a lot @cammellos for your feedback and comments!

Is there any similar mechanism in waku2? (The assumption is that we use sender time)

In general we do not timestamp messages in waku v2 so there is no filter on the coming messages based on their timestamp (@oskarth is that right?), We only use timestamp in the store protocol and for the pagination sake. Nevertheless, there the timestamp is generated by the receiver and not the sender.

staheri14 commented 3 years ago

Another issue that waku/whisper addresses is injection of messages in the stream. Basically you need to "trust" history nodes as generally you would not accept messages older/newer than +/- 20s of your clock, but with history nodes that's necessary, as you'd be receiving older messages. This effectively makes history nodes kind of special nodes and not just any nodes.

I agree with your point! we need to be aware of malicious nodes and take measures. This is also included in the "future step" part of the current issue and will certainly address it as the implementation grows to a more mature state. Thanks!

staheri14 commented 3 years ago

Another issue that waku/whisper addresses is injection of messages in the stream. Basically you need to "trust" history nodes as generally you would not accept messages older/newer than +/- 20s of your clock, but with history nodes that's necessary, as you'd be receiving older messages. This effectively makes history nodes kind of special nodes and not just any nodes.

Now I see your point (I had a different interpretation in my prior comment). Malicious store nodes can compromise the message history by injecting messages at a later time, you are absolutely right, this should not be allowed. I also took note of this in the current issue.

In the future steps, we are going to cover data data consistency across store nodes. I think in that phase we should/would be able to enforce/verify the integrity of the message history.

oskarth commented 3 years ago

l we do not timestamp messages in waku v2 so there is no filter on the coming messages based on their timestamp (@oskarth is that right?)

Correct. This can easily be added to https://rfc.vac.dev/spec/14/ though.

We could introduce it but make it a soft requirement initially w.r.t. store nodes and being pseudo clock synchronized. (This also plays into feedback from store nodes in terms of if a message is accepted or not (being too out of sync), which is a feature I assume the Core app may need.)

oskarth commented 3 years ago

@staheri14 what is missing in terms of research/specs for this?