sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.81k stars 697 forks source link

Gossipsub message queuing #2989

Open AgeManning opened 2 years ago

AgeManning commented 2 years ago

Description

Although in practice this is a code modification for the rust-libp2p repo, it's useful to discuss and track here as the ramifications will affect Lighthouse.

If a node has limited outbound bandwidth, in principle we could build up enormous queues in the peer Handler inside gossipsub. Messages could be queued for a while and when eventually sent, could be sent very late.

This could be one reason we are seeing nodes sending late messages, essentially that their outbound bandwidth is limited and we just queue for large periods of time.

There are two solutions I was contemplating:

  1. For each message in a handler's message queue (called send_queue in handler.rs) we tag it with a timestamp or duration in which it must be sent otherwise it is expired and dropped. I imagine this could be configurable as a default message timeout, or we could be fancy and implement some system that does it per-message. I think its simpler for a single timeout for all messages compareable to a few gossipsub heartbeats. Any messages that can't be processed in this time, expire and are dropped and we can emit an event to the end-user indicating this is occuring.
  2. We bound the handler outbound message queue. Overflowing this bound, would result in dropping new messages. Its a simpler approach to 1. but I think it's in our best interest (and gossipsub in general) to preference lively messages (i.e option 1.) rather than preference of when was published (option 2.)

A useful thing to note is that these queue's exist per-peer. So some peers can be slower than others and the timeout approach in 1. makes expirations uniform regardless of invidivudal peers connection speeds.

winksaville commented 2 years ago

Is there a metric or log entry on the length of these queues? It might be nice to see stats on the max length over the last 24 hours and max length life time.

AgeManning commented 2 years ago

Not inside gossipsub. We've only just introduced some more advanced metrics in gossipsub but these queues are not part of it.

Lighthouse has a bunch of metrics, but according to those and our code, we send in a timely fashion but we're seeing different results on the network, leading me to believe its the lower layer libp2p we may need to adjust.

divagant-martian commented 2 years ago

is this something we still want to do now that we have found the reason behind the late messages?