Open raulk opened 5 years ago
cc @whyrusleeping
Extend the signature of Publish() so it can take functional options to configure the reliability characteristics we want.
Note this is already available in #184
Make queue size configurable. It's currently hardcoded to 32, and cannot be changed.
woa I noticed this yesterday in a completely unrelated thing. What I saw is that hammering gossipsub results in messages seemingly not sent around (or dropped on receive).
TLDR: This the proposal does not fully solve this problem, but it's not clear it has to. If we want to more fully solve this problem there are more things we can talk through.
@hsanjuan, yep there are some things we can do here but there are some protocol level problems to tackle if we actually wanted to "solve" this problem (see: #197 for more about the problem).
For example, even if WaitUntilSent
was implemented such that it waited until the other party confirmed receipt of the data a couple things could still go wrong.
1) The receiver could crash, leaving the message effectively unsent
2) The receiver could not have WaitUntilSent
activated (or not with the same parameters) meaning that while your peers may receive the data their peers might not.
topicID = fn(topicName + options)
)As was mentioned in libp2p/go-libp2p#721 the current answer to dropped messages is to say "pubsub does not guarantee message delivery, if you want that layer something on top of pubsub", but that if we have enough users asking for guaranteed message delivery we should probably come up with something for them.
I don't think this response is super unreasonable and we already have some solutions (e.g. go-libp2p-pubsub-router) that handle the case of guaranteed message delivery. However, there are some protocol extensions we could look into like sending an ACK to the peer that sent us a message when once we've sent the message out or we failed/are out of peers.
Pubsub indeed does not guarantee message deliverability in terms of distributed behaviour, but in terms of local behaviour, it should absolutely guarantee that a message you intended to publish has actually left your box.
A lot of this problem can be solved by not hardcoding some channels' length to things like 32 (setting way higher, configurable defaults to give some space for bursts), and by complaining loudly when a message is dropped because channels are full. Also, more documentation about the implementation.
@raulk @aschmahmann I completely understand what we want to accomplish here. Please can I start working on the WaitUntilQueued
& WaitUntilSent
options ? Will share some notes/code soon.
@aarshkshah1992 fine by me, not sure if @raulk has any thoughts/concerns here.
Adding a thought about the WaitUntilQueuedALL
behaviour.
In the current PR, the WaitUntilQueuedALL
fails if we drop even a single message in the event loop. I think this is the ideal behaviour we want because in this case, it is hard for the publisher to set a expected count & stop when that target is achieved because the state of the mesh/fanout could change when we are actually writing to the outbound queues in the event loop.
However, this is very tricky to achieve if we don't want the writer/event loop to block or spin up a go-routine to do a blocking write because we could end up a dropping a "failure event" & the publish would wrongly conclude that everything is okay.
So, we might have to trade off this strong behaviour for something lighter if we want to avoid all forms of blocking.
@aschmahmann might have some opinions on this.
@raulk I'm running into some problems with this in go-libp2p-pubsub-router which builds a persistent value store on top of pubsub.
I'd like to solve the "messages can be dropped silently" by having pubsub emit an event that we dropped a message.
Aggregate RPCs waiting in the queue. Messages are associative and can be merged, in theory.
Sounds good, although it'd be really nice if the merge function was configurable instead of just concatenating the messages.
@vyzo this seem like a reasonable plan?
Currently if the outbound message queues for all peers in a mesh/fanout are full, gossipsub drops the message silently, which has already bitten us various times (in stress testing, and during actual usage).
This happens here: https://github.com/libp2p/go-libp2p-pubsub/blob/master/gossipsub.go#L359
Proposal
[ ] Extend the signature of
Publish()
so it can take functional options to configure the reliability characteristics we want.WaitUntilQueued(One | All | float64)
=> blocks the publish until the message has been queued to as many peers as the arg indicates. Can be implemented easily.WaitUntilSent(One | All | float64)
=> blocks the publish until the message has actually been dispatched to as many peers as the arg indicates. Requires deeper changes, but reliability is higher.[ ] Aggregate RPCs waiting in the queue. Messages are associative and can be merged, in theory.
[x] Make queue size configurable. It's currently hardcoded to 32, and cannot be changed.
Affected users
Filecoin, ETH2 gossipsub testing.