libp2p / go-libp2p-pubsub

The PubSub implementation for go-libp2p
https://github.com/libp2p/specs/tree/master/pubsub
Other
328 stars 186 forks source link

Published messages can be dropped silently #217

Open raulk opened 5 years ago

raulk commented 5 years ago

Currently if the outbound message queues for all peers in a mesh/fanout are full, gossipsub drops the message silently, which has already bitten us various times (in stress testing, and during actual usage).

This happens here: https://github.com/libp2p/go-libp2p-pubsub/blob/master/gossipsub.go#L359

Proposal

Affected users

Filecoin, ETH2 gossipsub testing.

raulk commented 5 years ago

cc @whyrusleeping

aschmahmann commented 5 years ago

Extend the signature of Publish() so it can take functional options to configure the reliability characteristics we want.

Note this is already available in #184

hsanjuan commented 5 years ago

Make queue size configurable. It's currently hardcoded to 32, and cannot be changed.

woa I noticed this yesterday in a completely unrelated thing. What I saw is that hammering gossipsub results in messages seemingly not sent around (or dropped on receive).

aschmahmann commented 5 years ago

TLDR: This the proposal does not fully solve this problem, but it's not clear it has to. If we want to more fully solve this problem there are more things we can talk through.

@hsanjuan, yep there are some things we can do here but there are some protocol level problems to tackle if we actually wanted to "solve" this problem (see: #197 for more about the problem).

For example, even if WaitUntilSent was implemented such that it waited until the other party confirmed receipt of the data a couple things could still go wrong.

1) The receiver could crash, leaving the message effectively unsent 2) The receiver could not have WaitUntilSent activated (or not with the same parameters) meaning that while your peers may receive the data their peers might not.

As was mentioned in libp2p/go-libp2p#721 the current answer to dropped messages is to say "pubsub does not guarantee message delivery, if you want that layer something on top of pubsub", but that if we have enough users asking for guaranteed message delivery we should probably come up with something for them.

I don't think this response is super unreasonable and we already have some solutions (e.g. go-libp2p-pubsub-router) that handle the case of guaranteed message delivery. However, there are some protocol extensions we could look into like sending an ACK to the peer that sent us a message when once we've sent the message out or we failed/are out of peers.

raulk commented 5 years ago

Pubsub indeed does not guarantee message deliverability in terms of distributed behaviour, but in terms of local behaviour, it should absolutely guarantee that a message you intended to publish has actually left your box.

hsanjuan commented 5 years ago

A lot of this problem can be solved by not hardcoding some channels' length to things like 32 (setting way higher, configurable defaults to give some space for bursts), and by complaining loudly when a message is dropped because channels are full. Also, more documentation about the implementation.

aarshkshah1992 commented 5 years ago

@raulk @aschmahmann I completely understand what we want to accomplish here. Please can I start working on the WaitUntilQueued & WaitUntilSent options ? Will share some notes/code soon.

aschmahmann commented 5 years ago

@aarshkshah1992 fine by me, not sure if @raulk has any thoughts/concerns here.

aarshkshah1992 commented 4 years ago

Adding a thought about the WaitUntilQueuedALL behaviour.

In the current PR, the WaitUntilQueuedALL fails if we drop even a single message in the event loop. I think this is the ideal behaviour we want because in this case, it is hard for the publisher to set a expected count & stop when that target is achieved because the state of the mesh/fanout could change when we are actually writing to the outbound queues in the event loop.

However, this is very tricky to achieve if we don't want the writer/event loop to block or spin up a go-routine to do a blocking write because we could end up a dropping a "failure event" & the publish would wrongly conclude that everything is okay.

So, we might have to trade off this strong behaviour for something lighter if we want to avoid all forms of blocking.

@aschmahmann might have some opinions on this.

aschmahmann commented 4 years ago

@raulk I'm running into some problems with this in go-libp2p-pubsub-router which builds a persistent value store on top of pubsub.

I'd like to solve the "messages can be dropped silently" by having pubsub emit an event that we dropped a message.

Aggregate RPCs waiting in the queue. Messages are associative and can be merged, in theory.

Sounds good, although it'd be really nice if the merge function was configurable instead of just concatenating the messages.

@vyzo this seem like a reasonable plan?