libp2p / go-libp2p-pubsub

The PubSub implementation for go-libp2p
https://github.com/libp2p/specs/tree/master/pubsub
Other
313 stars 179 forks source link

Offloading messages for async validation #169

Open raulk opened 5 years ago

raulk commented 5 years ago

From @arnetheduck (Nimbus, ETH 2.0 client):

@raulk we discussed topic validation in libp2p as a way to prevent bad information from spreading across the gossipsub network, though from what I can tell, the block propagation filtering method in libp2p that you pointed me to is synchronous (https://github.com/libp2p/go-libp2p-pubsub/blob/bfd65a2f6b810c5b4ad2cfe6bb9cc792fd7a0171/floodsub_test.go#L360). this might not sit well with block validation, where we might want to prevent gossiping a block until we've verified it based on data that we receive later. how would you recommend we approach this?

here's the scenario in detail:

  • over gossip, we receive a block whose parent we're missing
  • worst case, this means we cannot yet tell if it's a good / useful block or not
  • we don't want the block to be gossiped further until we've recovered its parent to ensure that it's sane. once we do know it's sane, we want to pass it on.

To summarise:

  1. Validation can be costly or not feasible in some scenarios to perform sync.
  2. Is it feasible to consume the message, do validation offline, then republish it? How does that affect message caches, duplicate detection across the network (e.g. if we send the message to peers who had already seen it -- and possibly even propagated it if they had more complete data than us), do we generate a new message ID?
  3. What are the differences on the wire between publishing a message afresh, and spreading a gossiped message?

In a nutshell: is it possible to offload a message from the pubsub router for async validation, then resume its gossiping conditionally?

vyzo commented 5 years ago

Validation is run asynchronously in a background goroutine.

raulk commented 5 years ago

@vyzo the concern is not with blocking the gossip thread. The use case is that validation of message M is co-dependant on other messages M’ that could’ve arrived previously, but may have not. If they didn’t, the client can pull them from the network. That process can cause validation of message M to lengthen to seconds or more. All the while, Gossipsub has a 150ms validation timeout, and also a throttling gadget.

Would you mind addressing the questions above so we can all gain more clarity on this scenario? Thanks.

vyzo commented 5 years ago

With the current implementation it's not possible. With quite a bit of work it may be possible.

raulk commented 5 years ago

Ok, so the validator would have to fail when it enters the non-deterministic scenario. We’d need a callback for failed validations, so that those messages can be processed separately.

Once we’re able to validate the message, we’d have to republish it. What’s the trade-off in terms of amplification and dedup? (It’s still the same message)

vyzo commented 5 years ago

It's a rather complex change to implement. The trade off is that the message propagation would be very slow, as it wouldn't be forwarded until it could be validated.

raulk commented 5 years ago

I think that tradeoff is known and accepted. They basically want nodes to forward only messages whose correctness can be verified against past state (e.g. one block depends on its parent). Since they’re async and eventually consistent, it’s possible that gossiped stuff arrives out of order. Also it’s possible that gossips never arrive, correct?

That’s ok. I’m more worried about the extra amplification, as the message cache could’ve slid before the message is republished and therefore it could reach the entire network again as gossipsub wouldn’t dedup, they’d have to dedup in their logic.

When you publish a message, can you force the original message ID?

arnetheduck commented 5 years ago

Re dedup, I don't think any sane eth2 client will rely on libp2p-level dedup - we have a block merkle root by which we identify the payload, both when requesting them and when receiving them from the network - this root is persistent across sessions.

I'd regard that part of the protocol as a nice-to-have optimization, nothing else. In fact, I find it hard to imagine an application that relies on once-only ordered delivery on top of a gossip setting and is correct at the same time.

Perhaps the right thing to do here is simply not to broadcast the message again. It's kind of natural that broadcasts are ephemeral, and trying to get that behavior from a gossip network goes against its grain somewhat.

It does raise an interesting question: how would a sat-link connection with high latency affect the system? How is the cache timeout tuned? the problem can happen naturally, in the wild, as well.

raulk commented 5 years ago

I’m talking about dedup insofar controlling amplification is concerned @arnetheduck. This is important to prevent cycling.

raulk commented 5 years ago

(Of course apps should ensure idempotency when relying on pubsub.)

arnetheduck commented 5 years ago

I’m talking about dedup insofar controlling amplification is concerned @arnetheduck. This is important to prevent cycling.

yeah, sorry for being unclear there: that's what I was alluding to with the sat-link question - how is the anti-cycling tuned with respect to high-latency links?

raulk commented 5 years ago

Right now it's not adaptive. We should explore this case together ;-) @arnetheduck

raulk commented 5 years ago

Copying over from the ethresearch/p2p Gitter thread:

Kevin Mai-Husan Chia @mhchia 12:07
We can use Validator to validate received content and return a boolean to tell
the pubsub to relay it or not. IMO in the simple cases the current structure
is enough for our usage. However, as the situation pointed out in the
discussion, later blocks might be received before the previous blocks. Then
the Validator run for those "orphan blocks" will be blocked, and the
Validators will time out. Even without the timeout, the number of the
concurrent Validators might go too large.

Raúl Kripalani @raulk 12:11
@mhchia thanks for rescuing that thread! a change to make validation async
would be welcome. it wouldn’t be too difficult. there’s already an
abstraction for datastores, so you would inject the datastore into the
pubsub router, and have it persist messages it is unable to validate
instantaneously, then spawn the validation job and report the result to
the router later. We’d need some form of GC to drop persisted messages
after a grace period, if the validation result never arrived.
raulk commented 5 years ago

By popular petition, we need to take this up, see #172. I have a design in mind which I’ll post later as I’m on mobile now.

raulk commented 5 years ago

An async validator feature could look like this:

type AsyncValidationResult struct {
    msg     *pubsub.Message
    result  error
}

type AsyncValidator interface {
    // Queue queues a message for future validation. If error is nil, the implementation promises to 
    // validate the message and return the result in the supplied channel at a later time. 
    //
    // The async validator is responsible for offloading the message from memory when
    // appropriate. It can use a Datastore or some other medium for this.
    Queue(ctx context.Context, msg *pubsub.Message, resp chan<- AsyncValidationResult) error
}

We'd need to work out how offloading a message would impact message caches and sliding windows.

vyzo commented 5 years ago

The seen cache would be most severely impacted, as messages can be rebroadcast into the network way after the 120s cache duration. We need to consider the effects of this.

vyzo commented 5 years ago

In terms of structure, we can add an api for forwarding prepared messages (ie messages published by someone else, already signed). This way we can offload the message for async validation. When the validator has completed, it can forward the message using the new api.

vyzo commented 5 years ago

176 supports long-running validators in the simplest possible manner:

It removes the default (short) timeout and allows the validators to run arbitrarily long without any need for api changes or complex contraptions.

vyzo commented 5 years ago

Note that you need to adjust the time cache duration accordingly.

On the other hand there is still a use case for completely offline validators, which could take days to complete.