vacp2p / rfc

Modular p2p messaging stack, with a focus on secure messaging.
https://rfc.vac.dev/
115 stars 15 forks source link

Protocol request: Direct group communication protocol for low-latency applications (<100ms) #446

Closed oskarth closed 3 months ago

oskarth commented 3 years ago

Problem

Some applications have a requirement for lower-latency direct communication as a group. This can be due to (soft) real time comm requirement. For example, video chat.

This can either be for 1-1 or as a group of N participants.

Relay/Gossip latency

From https://research.protocol.ai/publications/gossipsub-v1.1-evaluation-report/

Gossipsub-v1.1 achieves timely delivery.In our test network, with 1k honest peers and connection RTTs of 100 ms, we have not found a case where the v1.1 protocol experienced delivery delays higher than 1.6 sec for the 99th percentile of the latency distribution, even in scenarios with Sybil:honest connection ratio ashigh as 40:1. The maximum latency observed was about 5s but that affected a few messages while the systemwas recovering from an attack.

This is what we are working with. More benchmarking etc can be done, but gossiping over multiple hops in open network will always have some latency.

Example usage

Sketch

Basically we want to trade-off some metadata protection and flexibility for latency in a specific negotiated context.

We can use relay protocol to discover peers to talk to, then negotiate a separate group context where all nodes can dial each other. Then based on that context

The simplest version would be a 1-1 direct voice chat, say. Initially via WebSockets but WebRTC (or possibly QUIC?) would be useful to do things like video chat in a browser.

There may be some more infrastructure work on libp2p needed here to make this suitable for voice/video, cc @dryajov re this

100ms is based on general response time limits (https://www.nngroup.com/articles/response-times-3-important-limits/) as well as intuition re things like FPS gaming for "real time feel".

Acceptance criteria

  1. Issue with more limited scope for PoC

  2. Better understanding of hard requirements and required work / reduced uncertainty on things like:


^ @D4nte @jm-clius @arnetheduck @staheri14 FYI

staheri14 commented 3 years ago

If nodes in the limited context are supposed to be trusted as well as with no churn, then, for a more advanced solution, you may want to consider the Kademlia routing overlay which features lower storage overhead (logarithmic instead of linear) and logarithmic routing complexity. @oskarth

Update: Had another look at the issue, I think Kademlia might not be very relevant.

D4nte commented 2 years ago

Draft notes for potential bounty

outcome sketch:

Could possibly split this up

User story: As a user of Waku you should be able to find other nodes (e.g. in chat) and then establish a direct WebRTC connection

zah commented 2 years ago

Nimbus also has a use case for this where we would allow a group of Nimbus beacon nodes to work together in a way that ensures that there is no single point of failure in the system. The low latency is key for ensuring that all validator actions are performed in time (the validator rewards don't suffer as a result of latency) and Vac/Waku seem useful in the sense that they may allow the group to be formed with almost zero network configuration. The nodes can form groups automatically based on the validator identities and the user wouldn't have to deal with things such as public/private IP addresses, port forwarding, VPNs, etc.

D4nte commented 2 years ago

Thanks for the input @zah.

What is the current roadmap? is that something you would us to explore further?

fryorcraken commented 1 year ago

I wonder if the best way forward would be to create a nwaku PoC.

According to the requirements above and from https://notes.status.im/waku-vac-devcon-2022#

Nimbus is interested in an easy way for a group of beacon nodes to establish a direct P2P connection with low latency (e.g. it could be a WebRTC connection). The primary goal of operating such a group of nodes is to increase the resilience of the system (to remove single points of failure) and to address some concerns regarding possible DDoS attacks against a single beacon node. Since all nodes in the group will be owned by the same operator, there are no privacy concerns regarding the communication channel and having extremely low latency is the most desirable property. To get maximum safety, the nodes may be located in different data centers within a single city. Similarly, solo stakers may choose to run nodes from their homes, offices, etc, so another desirable property of the system is having zero configuration. This would be similar to a VPN network where the nodes can find each other regardless of the current physical network they are attached to. In Vac/Waku parlance, this could be considered a group channel with an automatically derived name (e.g. the name can be derived from the private key of the operated validators). In other words, the nodes would use the automatically determined channel to find each other to orchestrate the establishing of a full P2P mesh between them (potentially performing UDP hole punching in the process).

It seems that we still need some nat traversal/hole punching first in nwaku/nim-libp2p for that. @jm-clius what is the status for this and what issues are tracking?

Some design assumptions:

  1. Sym key encryption based on private key of the validator (e.g. double hash of private key)
  2. Content topics based on validator keys (e.g. double hash of public key)

Possible protocol (Alice, Bob are different nodes handled by the same validator as described above)

  1. Alice connects to Waku network
  2. Alice discovers her external ip address/port access, e.g. via libp2p identify)
  3. Alice broadcast her ENR on discovery content topic /nimbus/0/disc/<pubkey hash>/proto
  4. Bob connects to Waku network, retrieve ENRs from discovery content topic /nimbus/disc/0/<pubkey hash>/proto
  5. Bob directly connects to Alice
  6. Bob/Alice add each other as Waku Relay direct peers to ensure they are in each other's gossipsub's meshs i. OR, light push is used for even more resilience
  7. Alice/Bob uses other content topics to exchange messages, ie, /nimbus/0/tx/<pubkey hash>/proto

Other ideas:

cskiraly commented 1 year ago

A similar topic was going with the "Application-Layer Multicast" name some time ago. Focusing on low-latency, I could point to deadline-based schedulers (Abeni, L., Kiraly, C., Lo Cigno, R. (2009). On the Optimal Scheduling of Streaming Applications in Unstructured Meshes), and some other works we did in low-latency video distribution. Basically what you want to achieve is fast initial diffusion at the individual message level, and some peer/chuck selection policy that cuts the tail of the delay distribution by making lagging messages "catch up".

This means:

These together can nicely reduce the overall latency distribution.

zah commented 1 year ago

The usefulness of the proposed Nimbus setup increases dramatically when there are at least 3 nodes in the group (you would then use 2 out of 3 threshold signing to allow one of the nodes to be offline without disrupting the system). The ideal setup would involve 5 nodes configured with 3 out of 5 threshold signing.

Using the public key hash in the topic name is not an ideal solution as this would allow other nodes on the network to speculatively monitor all public keys to discover the ENRs of the participating nodes, but this is just a detail for which we'll surely find an appropriate solution.

Setup with more nodes won't improve the reliability further, but the latency will be increased, so I think for our use case we care about group sizes of up to 5 nodes. Due to this, I think a full mesh would be the most appropriate topology (every node sends its own messages to all other nodes).

jm-clius commented 1 year ago

It seems that we still need some nat traversal/hole punching first in nwaku/nim-libp2p for that. @jm-clius what is the status for this and what issues are tracking?

This is tracked as medium-to-high priority (my interpretation) in the nim-libp2p roadmap: https://github.com/status-im/nim-libp2p/issues/777

Menduist commented 1 year ago

nim-libp2p can already be used as a hole punching server (autonat & relay are available), but cannot hole punch itself (missing the dctur for that)

fryorcraken commented 1 year ago

Such enhancement would also be interesting for larger data transfer.

jimstir commented 3 months ago

Issue moved here