Open AgeManning opened 2 years ago
Is there a metric or log entry on the length of these queues? It might be nice to see stats on the max length over the last 24 hours and max length life time.
Not inside gossipsub. We've only just introduced some more advanced metrics in gossipsub but these queues are not part of it.
Lighthouse has a bunch of metrics, but according to those and our code, we send in a timely fashion but we're seeing different results on the network, leading me to believe its the lower layer libp2p we may need to adjust.
is this something we still want to do now that we have found the reason behind the late messages?
Description
Although in practice this is a code modification for the rust-libp2p repo, it's useful to discuss and track here as the ramifications will affect Lighthouse.
If a node has limited outbound bandwidth, in principle we could build up enormous queues in the peer
Handler
inside gossipsub. Messages could be queued for a while and when eventually sent, could be sent very late.This could be one reason we are seeing nodes sending late messages, essentially that their outbound bandwidth is limited and we just queue for large periods of time.
There are two solutions I was contemplating:
send_queue
in handler.rs) we tag it with a timestamp or duration in which it must be sent otherwise it is expired and dropped. I imagine this could be configurable as a default message timeout, or we could be fancy and implement some system that does it per-message. I think its simpler for a single timeout for all messages compareable to a few gossipsub heartbeats. Any messages that can't be processed in this time, expire and are dropped and we can emit an event to the end-user indicating this is occuring.A useful thing to note is that these queue's exist per-peer. So some peers can be slower than others and the timeout approach in 1. makes expirations uniform regardless of invidivudal peers connection speeds.