libp2p / specs

Technical specifications for the libp2p networking stack
https://libp2p.io
1.58k stars 275 forks source link

Gossipsub mesh does not converge #215

Closed aschmahmann closed 3 years ago

aschmahmann commented 5 years ago

The rules for gossipsub's GRAFT and PRUNE messages are:

If the mesh is a star with a single node connected to D_high + 1 nodes then the center node will be fighting to PRUNE nodes while the edge nodes GRAFT. This will go on indefinitely.

While it's not the end of the world to send these extra messages 1/second (current heartbeat), it's definitely a waste of bandwidth for basically no reason. The "correct" answer is probably to have some tree forming algorithm as part of gossipsub. However, for now it'd be pretty straightforward to implement a backoff system where if A sends GRAFT to B and B sends PRUNE to A that A waits a while before trying to send GRAFT again.

Thoughts @raulk @Stebalien @vyzo ?

raulk commented 5 years ago

My off-the-cuff thought here is that it's difficult to create peer-to-peer gossip algorithms that operate optimally for all possible topologies.

In this particular case, the hub is deliberately acting like a broker, and therefore you may want to increase D for it (it should be configurable). Ideally you'd maintain D in lockstep with the peer count the hub observes, but I don't know if it's adjustable during runtime.

Essentially what you're getting by using gossipsub in this design is the topic-based messaging functionality. You're not interested in the routing, because that's always static.

aschmahmann commented 5 years ago

it's difficult to create peer-to-peer gossip algorithms that operate optimally for all possible topologies.

Sure, although if you implemented a backoff so peers stopped GRAFTing peers that just PRUNEd them that would at least give some convergence.

I don't know if it's adjustable during runtime

In Go it's a global variable, so it's situational (e.g. I can't run two gossipsub topics, or even instances, with different D)

You're not interested in the routing, because that's always static.

I suspect many users will not know their exact network topology ahead of time and will rely on various libp2p Discovery mechanisms to find peers to connect to. The chaotic nature of just discovering peers instead of carefully GRAFTing them into an optimal topology can certainly result in this GRAFT-PRUNE loop emerging (especially when you take into account NAT'd nodes connecting to the rest of the network).

vyzo commented 4 years ago

This is resolved by gossipsub v1.1, which is the first iteration moving to episub. See https://github.com/libp2p/go-libp2p-pubsub/pull/234

aschmahmann commented 4 years ago

@vyzo this is fixed now, right?

vyzo commented 4 years ago

Yes, if you enable peer exchange (WithPeerExchange(true)), then the mesh will converge.

mxinden commented 3 years ago

Considering this fixed with https://github.com/libp2p/go-libp2p-pubsub/pull/234 and gossipsub v1.1. Closing. Please comment in case I am missing something.