Open raulk opened 5 years ago
Validation is run asynchronously in a background goroutine.
@vyzo the concern is not with blocking the gossip thread. The use case is that validation of message M is co-dependant on other messages M’ that could’ve arrived previously, but may have not. If they didn’t, the client can pull them from the network. That process can cause validation of message M to lengthen to seconds or more. All the while, Gossipsub has a 150ms validation timeout, and also a throttling gadget.
Would you mind addressing the questions above so we can all gain more clarity on this scenario? Thanks.
With the current implementation it's not possible. With quite a bit of work it may be possible.
Ok, so the validator would have to fail when it enters the non-deterministic scenario. We’d need a callback for failed validations, so that those messages can be processed separately.
Once we’re able to validate the message, we’d have to republish it. What’s the trade-off in terms of amplification and dedup? (It’s still the same message)
It's a rather complex change to implement. The trade off is that the message propagation would be very slow, as it wouldn't be forwarded until it could be validated.
I think that tradeoff is known and accepted. They basically want nodes to forward only messages whose correctness can be verified against past state (e.g. one block depends on its parent). Since they’re async and eventually consistent, it’s possible that gossiped stuff arrives out of order. Also it’s possible that gossips never arrive, correct?
That’s ok. I’m more worried about the extra amplification, as the message cache could’ve slid before the message is republished and therefore it could reach the entire network again as gossipsub wouldn’t dedup, they’d have to dedup in their logic.
When you publish a message, can you force the original message ID?
Re dedup, I don't think any sane eth2 client will rely on libp2p-level dedup - we have a block merkle root by which we identify the payload, both when requesting them and when receiving them from the network - this root is persistent across sessions.
I'd regard that part of the protocol as a nice-to-have optimization, nothing else. In fact, I find it hard to imagine an application that relies on once-only ordered delivery on top of a gossip setting and is correct at the same time.
Perhaps the right thing to do here is simply not to broadcast the message again. It's kind of natural that broadcasts are ephemeral, and trying to get that behavior from a gossip network goes against its grain somewhat.
It does raise an interesting question: how would a sat-link connection with high latency affect the system? How is the cache timeout tuned? the problem can happen naturally, in the wild, as well.
I’m talking about dedup insofar controlling amplification is concerned @arnetheduck. This is important to prevent cycling.
(Of course apps should ensure idempotency when relying on pubsub.)
I’m talking about dedup insofar controlling amplification is concerned @arnetheduck. This is important to prevent cycling.
yeah, sorry for being unclear there: that's what I was alluding to with the sat-link question - how is the anti-cycling tuned with respect to high-latency links?
Right now it's not adaptive. We should explore this case together ;-) @arnetheduck
Copying over from the ethresearch/p2p Gitter thread:
Kevin Mai-Husan Chia @mhchia 12:07
We can use Validator to validate received content and return a boolean to tell
the pubsub to relay it or not. IMO in the simple cases the current structure
is enough for our usage. However, as the situation pointed out in the
discussion, later blocks might be received before the previous blocks. Then
the Validator run for those "orphan blocks" will be blocked, and the
Validators will time out. Even without the timeout, the number of the
concurrent Validators might go too large.
Raúl Kripalani @raulk 12:11
@mhchia thanks for rescuing that thread! a change to make validation async
would be welcome. it wouldn’t be too difficult. there’s already an
abstraction for datastores, so you would inject the datastore into the
pubsub router, and have it persist messages it is unable to validate
instantaneously, then spawn the validation job and report the result to
the router later. We’d need some form of GC to drop persisted messages
after a grace period, if the validation result never arrived.
By popular petition, we need to take this up, see #172. I have a design in mind which I’ll post later as I’m on mobile now.
An async validator feature could look like this:
type AsyncValidationResult struct {
msg *pubsub.Message
result error
}
type AsyncValidator interface {
// Queue queues a message for future validation. If error is nil, the implementation promises to
// validate the message and return the result in the supplied channel at a later time.
//
// The async validator is responsible for offloading the message from memory when
// appropriate. It can use a Datastore or some other medium for this.
Queue(ctx context.Context, msg *pubsub.Message, resp chan<- AsyncValidationResult) error
}
We'd need to work out how offloading a message would impact message caches and sliding windows.
The seen cache would be most severely impacted, as messages can be rebroadcast into the network way after the 120s cache duration. We need to consider the effects of this.
In terms of structure, we can add an api for forwarding prepared messages (ie messages published by someone else, already signed). This way we can offload the message for async validation. When the validator has completed, it can forward the message using the new api.
It removes the default (short) timeout and allows the validators to run arbitrarily long without any need for api changes or complex contraptions.
Note that you need to adjust the time cache duration accordingly.
On the other hand there is still a use case for completely offline validators, which could take days to complete.
From @arnetheduck (Nimbus, ETH 2.0 client):
To summarise:
In a nutshell: is it possible to offload a message from the pubsub router for async validation, then resume its gossiping conditionally?