libp2p / specs

Technical specifications for the libp2p networking stack
https://libp2p.io
1.58k stars 275 forks source link

Gossipsub extension for Epidemic Meshes (v1.2.0) #413

Open AgeManning opened 2 years ago

AgeManning commented 2 years ago

This is a minimal extension aiming to minimize duplicates and latency on the mesh networks in gossipsub v1.1.0.

This is a working draft at the moment. Simulations and implementations will be added as we go.

All feedback welcome.

cskiraly commented 2 years ago

Hey, nice idea with CHOKE/UNCHOKE! Duplicates are a real pain and will become more of so with larger degrees and/or larger messages.

Let me use this thread to say that I was playing with two different ideas to reduce duplicates. Let me know if you think they make sense and can be somehow be combined with yours.

vyzo commented 2 years ago

It should be added that this is based on a sync session we had with Andrian, so it is not coming out of the blue.

Menduist commented 2 years ago

Hi, nice addition! A few remarks:

If it's backward compatible, why advertise another Protocol Id? I don't think it's super necessary to know wether a peer handles the choking or not, so we could just send it optimistically, no?

I'm not super convinced about the unchoking mechanism, if we have to rely on IHAVEs it might already be "too late". Maybe we could do something similar to bittorrent, with period based choking, or optimistic unchoking?

Regarding simulation, it seems easy enough to try this algorithm on real networks, by "virtually choking" a peer, and compute the bandwidth gain / latency loss. It would also help us tune the unchoking mechanism

Last word regarding the spec format, this is starting to get a bit messy imo: there is gossipsub 1.1 which references gossipsub 1.0, which references pubsub. To have the full spec, I will soon need to open 4 spec. Versionning is also suprising, if it's backward compatible, why call it 2.0.0? It's also not consistent with v1.1 If we go the 2.0.0 route, I would rather make this spec "standalone", ie not relying on gossipsub 1.0 and 1.1 Otherwise, this spec could be a simple "addendum" to the 1.0 spec, but we should find a better way to version thoses

vyzo commented 2 years ago

we need the protocol id change to signify feature support.

vyzo commented 2 years ago

Pethaps it could be called v1.2.0 however.

AgeManning commented 2 years ago

Hey all. Thanks for the responses.

Sorry wasn't expecting it to get much attention. This is based on a call with @vyzo and this PR was mainly just a format to share a draft of the ideas @vyzo and I discussed that could be easily commented on.

The format, structure and versioning etc, we can change, I dont imagine the PR as described here will resemble the final result.

I'm indifferent about the versioning, but agree the 2.0.0 that I suggested isn't consistent.

The first one is to be able to set up a filter in your peers. This depends in message ID structure, but with a simple example you could set up some peers to send only ID == 0 mod 2, while others to send only ID ==1 mod 2. This would be a rather lightweight, and allow fine-grained control. Similarly to CHOKE, IHAVEs could be turned on, maybe with a flag. I think this can be defined as a relatively simple extension of your mechanism.

If I understand this, the application chooses (maybe dynamically) which messages nodes propagate on the mesh? It sounds like depending on the message you're segregating the mesh overlay into smaller overlays? I don't directly see the benefit to this vs just making the mesh size smaller? I may have misunderstood however.

The second is a much bigger change. It is based on the observation that most of the duplicates are generated towards the end of the diffusion graph of a single message (note that each message has its own diffusion graph). The main idea is to reduce the duplicates in this last phase of diffusion. Unfortunately, to do this the sender node would have to know somehow that it is in the last phase of diffusion. The idea, in short, is to introduce hop count and when hop count is above a certain threshold, do probabilistic forwarding. Unfortunately hop count goes against some of the privacy guarantees, so this idea obviously needs some more work :-)

Interesting idea. As you've pointed out the privacy issues in this model would not work for my particular use-case but it may work for others.

I'm not super convinced about the unchoking mechanism, if we have to rely on IHAVEs it might already be "too late". Maybe we could do something similar to bittorrent, with period based choking, or optimistic unchoking?

Good point. I'm also unsure how this will behave if we choke too aggressively. Ideally I was trying to hit a set of parameters that make it not so network dependent, such that standard default values could benefit most networks. There's a number of levers we can use to adjust the speed of unchoking based on gossip still however. For example, we could have very fast heartbeats (lots of gossip), and maybe have a different time scale at which we decide on unchoking based on the gossip. I put the thresholds in there as a placeholder for simulations, but potentially lowering those could also behave like optimistic unchoking. Also happy to put explicit optimistic unchoking in if we think its required.

I was hoping simulations could give us more insights on things lacking with this proposal.

cskiraly commented 2 years ago

The first one is to be able to set up a filter in your peers. This depends in message ID structure, but with a simple example you could set up some peers to send only ID == 0 mod 2, while others to send only ID ==1 mod 2. This would be a rather lightweight, and allow fine-grained control. Similarly to CHOKE, IHAVEs could be turned on, maybe with a flag. I think this can be defined as a relatively simple extension of your mechanism.

If I understand this, the application chooses (maybe dynamically) which messages nodes propagate on the mesh? It sounds like depending on the message you're segregating the mesh overlay into smaller overlays? I don't directly see the benefit to this vs just making the mesh size smaller? I may have misunderstood however.

OK, I think I wasn't clear on the idea. This would be just a local downlink filter, valid for only that specific link. Therefore, it is not splitting the mesh, just "specializes" some links to a subset of the traffic. You could have a mesh link on which you send all and asked to receive only id==1 mod 2, and another link where you were asked to send only id=0 mod 3. You might be right that you can look at this as having 6 different meshes (mod 6 in the above example), but I think you are amortizing large part of the maintenance cost while you can achieve some dynamic behaviour that allows going for larger D.

This could be useful because it provides a more fine-grained control than pure choking/unchoking. It could also fight one of the main reasons for duplicates and thus choking: propagation latency due to bottlenecks and thus queues. I hope it is clear now. The filter is just the protocol primitive, how it is regulated would indeed need lots of work.

The second is a much bigger change. It is based on the observation that most of the duplicates are generated towards the end of the diffusion graph of a single message (note that each message has its own diffusion graph). The main idea is to reduce the duplicates in this last phase of diffusion. Unfortunately, to do this the sender node would have to know somehow that it is in the last phase of diffusion. The idea, in short, is to introduce hop count and when hop count is above a certain threshold, do probabilistic forwarding. Unfortunately hop count goes against some of the privacy guarantees, so this idea obviously needs some more work :-)

Interesting idea. As you've pointed out the privacy issues in this model would not work for my particular use-case but it may work for others.

On this I've actually started simulations some time ago using the code from @vyzo. My changes are very rudimentary, but adds the hopcount and timestamps https://github.com/cskiraly/gerbil-simsub/commits/hopcount This is indeed for other uses that are not that privacy sensitive. However, the hopcount is still useful in simulations to understand the diffusion graph of messages better.

vyzo commented 2 years ago

So there has been some progress in the self-appointed Episub Working Group (if you are maintaining a gossipsub implementation and you are interested in implementing episub, you are welcome to join us).

The simsub simulator now supports preliminary episub, with a number of choking strategies. I have run a number of simulations, with the results here: https://gist.github.com/vyzo/77251c89950ef0d9ac422cfce8e5a09e

Note that the simulation are 100 node ones, because my computer is too slow for larger ones and begins to lag; feel free to run your own sims and post results. The gist is however, that the thing generally works -- we can get very significant reduction in bandwidth as far as data messages are concerned.

vyzo commented 2 years ago

An update: I have added a virtual time scheduler to simsub in https://github.com/vyzo/gerbil-simsub/pull/7, which should allow it to scale simulations to large number of nodes bound only by available memory. I should have some more results tomorrow.

vyzo commented 2 years ago

Here are the result from 250 and 500 node runs:

Update:

vyzo commented 1 year ago

We should target this to #560

AgeManning commented 1 year ago

Yep. I have been following these changes. Although I have a working rust version of this, it seems we should hold off until the other changes are added first which have better modelling done on them at the moment.