paradigmxyz / reth

Modular, contributor-friendly and blazing-fast implementation of the Ethereum protocol, in Rust
https://reth.rs/
Apache License 2.0
4.02k stars 1.24k forks source link

Add support for bandwith monitoring #570

Closed mattsse closed 1 year ago

mattsse commented 1 year ago

Describe the feature

In order to provide bandwith metrics we need to monitor all incoming and outgoing packets, something like

pub struct Bandwidth {
    inbound: AtomicU64,
    outbound: AtomicU64,
}

it's probably the easiest way to instrument the TcpStream, or rather over AsyncRead and AsyncWrite bounds and simply increment the counters

Additional context

No response

akirillo commented 1 year ago

Looking into this, taking inspiration from https://github.com/libp2p/rust-libp2p/blob/76abab9e20c29d03f787f39565de38dff42b0f80/src/bandwidth.rs#L135-L141

akirillo commented 1 year ago

update on this, taking some time to read about futures & generally async programming in rust here: https://rust-lang.github.io/async-book/01_getting_started/01_chapter.html

akirillo commented 1 year ago

In rust-libp2p, they have a StreamMuxer struct which enables opening "substreams." Instrumenting the StreamMuxer with bandwidth monitoring is, I believe, how they get total bandwidth monitoring, across all streams.

Do we want to pursue a similar pattern to get a sense of total bandwidth usage by the node? I'm not sure how much refactoring this implies, something to the tune of attaching all streams w/ the network to such a "muxer" struct.

If not, and we're okay with instrumenting individual streams with bandwidth monitoring, how do we envision these metrics getting consumed? Output via Prometheus? Log? Should they be summed somewhere?

gakonst commented 1 year ago

Do we want to pursue a similar pattern to get a sense of total bandwidth usage by the node? I'm not sure how much refactoring this implies, something to the tune of attaching all streams w/ the network to such a "muxer" struct.

I think for total bandwidth usage we can use common network debugging tools like iftop.

If not, and we're okay with instrumenting individual streams with bandwidth monitoring, how do we envision these metrics getting consumed? Output via Prometheus? Log? Should they be summed somewhere?

I think probably prometheus right? We'd like to basically get outputs similar to iftop but on a per stream basis? It's also possible that if we can observe the bandwidth inside of our code, we can also modify peers' reputation that way?

akirillo commented 1 year ago

I think probably prometheus right? We'd like to basically get outputs similar to iftop but on a per stream basis? It's also possible that if we can observe the bandwidth inside of our code, we can also modify peers' reputation that way?

sounds good - I'll integrate my current code w/ usage of the metrics crate / derives etc

akirillo commented 1 year ago

hmm, @mattsse does this mean the struct definition should look like:

#[derive(Metrics)]
#[metrics(scope = "bandwidth")]
pub struct BandwidthMetrics {
    inbound: Counter,
    outbound: Counter,
}

Seems like how we use metrics, but I have two concerns:

  1. Will this be able to account for multiple instances of the struct (i.e. one per instrumented stream)?
  2. I see that CounterFn is implemented for AtomicU64, but how do make sure that's used under the hood by Counter?
onbjerg commented 1 year ago

Ref https://github.com/paradigmxyz/reth/issues/140

mattsse commented 1 year ago

the metrics will be separate and should be Gauge that we then periodically set based on the total bytes read/written.

So all sessions update the bytes read/written in the shared Atomic and metrics are updated with these numbers:

current throughput:

updated = (total_bytes_now - total_bytes_last)/interval

akirillo commented 1 year ago

Gotcha, that can be handled separately in #140 then (thanks for referencing that!)

akirillo commented 1 year ago

Would we need a separate Gauge for each stream we want to meter (as we will have separate BandwithMeters for each stream), or would this all bubble up to a top-level Gauge that meters the throughput across the whole node?

If it's separate Gauges, we need a way to describe() each instance differently, right

akirillo commented 1 year ago

Either way, drafting an approach in #726

gakonst commented 1 year ago

Also assigning @onbjerg to work on this with you since he was going to pick this up early next week once done with the PMing scoping.