waku-org / pm

Project management, admin, misc
3 stars 1 forks source link

To research: gossipsub message publish feedback and guarantees #16

Open richard-ramos opened 1 year ago

richard-ramos commented 1 year ago

Messages that are broadcasted via gossipsub do not receive an ACK confirming when they're received by peers. This is problematic in the following situation:

  1. You have a relay node connected to some peers
  2. The node is completely disconnected from internet (for ex, for 1hr)
  3. You attempt to send a message.

In go-waku (and i imagine, nwaku too), the message is sent successfully. Pubsub has separate inbound and outbound streams for RPC messages, and do not acknowledge that it has received a message, meaning that up to the moment a TCP write timeout happens, we think messages were sent successfully (they're probably being buffered at OS level in the meantime)

In https://github.com/libp2p/specs/tree/master/pubsub#the-rpc we can see that communications happen by passing RPC messages, but it does not mention anything about ACKs of these messages (which could be as simple as passing a simple byte back indicating that the message was received)

The writing timeout can take some minutes to happen, so we instead rely on the keep-alive loop, which will ping the peers and use the failure to ping to know that the peer is disconnected, but this can take ~40s, so there will be a time period in which messages will be marked as sent incorrectly.

This issue was reported for mobile (which ideally should run filter and lightpush protocol instead), but network connectivity issues could happen on desktop too although less frequently.

Related issue: https://github.com/status-im/status-mobile/issues/14797

cc: @LNSD @kaiserd @Menduist @jm-clius @cammellos

cammellos commented 1 year ago

Just to add, from past experience, this is something that caused some headaches, since not knowing that a message has been published (especially on mobile networks), can results in message loss and the perception of unreliability. Desktop might be different though, since connection is more stable, but it's worth thinking about it.

Menduist commented 1 year ago

Sending a ACK for each message is going to be a bit heavy, as you'll at least have to send the message id, which is 32 bytes (not heavy per se, but that is for each message). Also, you may loose the ACK but not the original message, and the original sender will think it didn't got published, even though it did. And will probably send a duplicate later on the network

Maybe instead:

Or something like that?

cammellos commented 1 year ago

Also, you may loose the ACK but not the original message, and the original sender will think it didn't got published, even though it did. And will probably send a duplicate later on the network

Generally, from the app perspective, this is preferred, we'd rather re-transmit, than drop.

Quick question though, if I re-transmit, the id will be the same, would the message be propagated in this case? or there's a cache kept by relayer in order to avoid relaying duplicate messages?

For waku-1, that was a setting between two nodes, so a node could notify the other that it wanted confirmations or not, so wasn't enabled across the network, which leaks metadata, but saves some bandwidth.

We could also lower the 32bytes requirement to send back say 4 bytes only, that should be good enough since clashes are unlikely, but still of course the toll on the network is higher.

Keep a list of the last N sent messages When you get back online, query the store for these messages (need some form of unique ID) If some are missing, re-send

I think this would be too slow from a ux perspective. I am going to consider mobile for now, although I understand that we'll use filter/etc, but just to give a baseline, desktop might be different as more stable connections.

User A sends a message to B. Say the message is important, and connection is flaky.

We want to make sure the user sees the message as not being dispatched, so they know that some action is required on their side (resend, make sure you are connected to internet etc), and they don't just put the phone back in their pocket.

If we pulled next time we went online, we either would have to show the messages as "not dispatched" until online/offline event happens, which would be frustrating, or we query the store node after sending, but it would increase load on store nodes, or we would mark the message as "dispatched" and "revert" to not dispatched when we are online again. But in that case, the user would not understand that the message wasn't dispatched, and would put back the phone in their pocket, while the message was never dispatched.

Sorry for the blurb, not sure it's clear :)

Menduist commented 1 year ago

Quick question though, if I re-transmit, the id will be the same, would the message be propagated in this case?

My understanding is that you would need a new timestamp, which would mean that the messages are not dedupped since they are not equal at this point (hence the fact that retransmission can cause issues)

Good points though, will think more about it

LNSD commented 1 year ago

My understanding is that you would need a new timestamp, which would mean that the messages are not dedupped since they are not equal at this point (hence the fact that retransmission can cause issues)

Note that the Message Unique ID schema I proposed does not consider the timestamp field (it is not hashed). So it supports retransmissions without affecting the ID.

LNSD commented 1 year ago

I think libp2p Gossipsub is implemented thinking on the "ideal" scenario of cloud nodes with significantly high availability and bandwidth. Suppose we aim to increase the resiliency and robustness of Waku Relay. In that case, we need to spend some time understanding the scenarios and their requirements (e.g., a laptop connected to a wifi AP can be considered a mobile device).

Here I see two things:

  1. The message publish in lip2p Pubsub/Gossipsub is a "fire-and-forget" action. This lack of feedback for the publishing process is problematic. It prevents the caller from having fast feedback if the publish action. Consequently, the caller cannot react to it (e.g., it is impossible to set a publish timeout).
  2. What should an application do when the publish fails? This is something that the application, depending on the domain (e.g., a messaging app, a blockchain node, a real-time online game, etc.) should decide. IMO, as a library, we should not decide. Waku should provide a competitive solution, and make decisions based on what we claim that Waku Relay adds on top of libp2p's Gossipsub.

Failure recovery, robustness and reliability often require extra bandwidth consumption. It is a toll to pay to have reliability. But different optimization strategies could be implemented (e.g., Gossipsub is already using RPC message piggybacking, NACKs, etc.).

P.S.: Adding an ACK is just a naive suggestion @richard-ramos and I discussed. Other strategies could be studied too.

LNSD commented 1 year ago

Regarding this:

I am going to consider mobile for now, although I understand that we'll use filter/etc, but just to give a baseline, desktop might be different as more stable connections.

Note that there is a significant risk in using Waku Filter (a "server-push" protocol) in the mobile implementation for the same reasons we are discussing here (and many more). I already raised this concern in other forums.

When you get back online, query the store for these messages (need some form of unique ID)

The next evolution of the Waku Store and Waku Archive (based on the Message Unique ID) aims to provide a "client-pull" polling alternative solution to the Waku Filter protocol.

cammellos commented 1 year ago

Regarding this:

I am going to consider mobile for now, although I understand that we'll use filter/etc, but just to give a baseline, desktop might be different as more stable connections.

Note that there is a significant risk in using Waku Filter (a "server-push" protocol) in the mobile implementation for the same reasons we are discussing here (and many more). I already raised this concern in other forums.

When you get back online, query the store for these messages (need some form of unique ID)

The next evolution of the Waku Store and Waku Archive (based on the Message Unique ID) aims to provide a "client-pull" polling alternative solution to the Waku Filter protocol.

I'd be interested in knowing a bit more about the concerns, do you have some links etc, otherwise maybe one day we can have a chat, if you don't mind thanks

fryorcraken commented 1 year ago

Adding a ACK to libp2p-gossipsub seems very heavy indeed. As mentioned above, one could "request a ACK" but then this clearly reveal they are the original sender of the message and remove any privacy preservation Waku Relay may have.

What about just expecting to "see" the message within a given time frame.

For example, let's say we have a mesh with 8 peers. When sending a message, only send to 6 peers from the mesh.

Then, expect that within 5 seconds, one of two other peer should forward you the message.

If it it does not happen, then it would be safe to assume there was a transmission error and re-transmit.

fryorcraken commented 1 year ago

Notes from 21 Feb call.