paradigmxyz / reth

Modular, contributor-friendly and blazing-fast implementation of the Ethereum protocol, in Rust
https://reth.rs/
Apache License 2.0
3.85k stars 1.11k forks source link

Tracking: discovery quality metrics #3665

Open Rjected opened 1 year ago

Rjected commented 1 year ago

Describe the feature

I was very impressed by https://github.com/ethereum/go-ethereum/pull/27621 and thought it would be a good idea for us to replicate some of these metrics. This tracks what we would need to add in reth, and what we would need to add in the existing grafana dashboard. Each of these tasks are likely small, so anyone who would like to take one should let me know, and I'll create an issue to track the individual task!

Reth - Peer Discovery dashboard

Discovery

We would need an additional metric for nodes in kbuckets, and graph in grafana:

### Discovery Metrics
- [ ] `discover/bucket/{index}/count` - the number of nodes in bucket index
- [x] Dashboard for the kbuckets metric: https://reth.paradigm.xyz/d/fd2d69b5-ca32-45d0-946e-c00ddcd7052c/reth---peer-discovery?orgId=1&refresh=30s&from=1724829418332&to=1724833018332&viewPanel=198; https://reth.paradigm.xyz/d/fd2d69b5-ca32-45d0-946e-c00ddcd7052c/reth---peer-discovery?orgId=1&refresh=30s&from=1724829458983&to=1724833058983&viewPanel=201

Reth - Transaction Pool dashboard

eth handshake

We would need to create metrics for eth handshake errors that map to these:

### Eth protocol metrics
- [ ] `eth/protocols/eth/{ingress|egress}/handshake/error/peer` - unexpected peer behavior
- [ ] `eth/protocols/eth/{ingress|egress}/handshake/error/timeout` - handshake timeout
- [ ] `eth/protocols/eth/{ingress|egress}/handshake/error/network` - wrong network id
- [ ] `eth/protocols/eth/{ingress|egress}/handshake/error/version` - wrong protocol version
- [ ] `eth/protocols/eth/{ingress|egress}/handshake/error/genesis` - wrong genesis
- [ ] `eth/protocols/eth/{ingress|egress}/handshake/error/forkid` - wrong forkid

p2p dials

We already have some error metrics: https://github.com/paradigmxyz/reth/blob/526f624e1cbfd659184873bffe12ec602678b57f/crates/net/network/src/metrics.rs#L64-L110

So we should make sure that the error metrics are in the grafana dashboard and named appropriately, and add the metrics which we don't already have:

### P2P dial metrics
- [ ] `p2p/dials/error/connection` - unable to initiate TCP connection to target
- [ ] `p2p/dials/error/saturated` - local client is already connected to its maximum number of peers
- [ ] `p2p/dials/error/known` - dialing an already connected peer
- [ ] `p2p/dials/error/self` - dialing the local node's id
- [ ] `p2p/dials/error/useless` - dialing a peer that doesn't share an capabilities with the local node
- [ ] `p2p/dials/error/id/unexpected` - dialed peer repsoned with different id than expected
- [ ] `p2p/dials/error/rlpx/enc` - error negotiating the rlpx encryption during dial
- [ ] `p2p/dials/error/rlpx/proto` - error during rlpx protocol handshake`
- [ ] https://github.com/paradigmxyz/reth/issues/3667

Out of these, I don't think we have a metric similar to p2p/dials/success.

Grafana

In addition to these, it would be nice to create a graph similar to the Dial Quality graph from the original PR: image

Additional context

No response

maxwolff commented 1 year ago

im down to help out with this.

gakonst commented 1 year ago

Go for it Max - and yes it is insane how high quality these geth metrics are

maxwolff commented 1 year ago

hey actually a bit confused. DisconnectMetrics gets incremented both for incoming and outgoing dials. "dial metrics" seems to imply it is only relevant for outgoing dials

maybe instead we should keep one set of disconnect metrics on network/src/manager.rs for outgoing, one for incoming, and then add the remaining metrics for outgoing dials (rlpx, connection)? otherwise its a bit logically inconsistent.

eg maybe

pub struct NetworkManager<C> {
..
    metrics: NetworkMetrics,
    incoming_disconnect_metrics: DisconnectMetrics,
    outgoing_disconnect_metrics: DisconnectMetrics,
    dial_metrics: DialMetrics

@Rjected

Rjected commented 8 months ago

hey actually a bit confused. DisconnectMetrics gets incremented both for incoming and outgoing dials. "dial metrics" seems to imply it is only relevant for outgoing dials

maybe instead we should keep one set of disconnect metrics on network/src/manager.rs for outgoing, one for incoming, and then add the remaining metrics for outgoing dials (rlpx, connection)? otherwise its a bit logically inconsistent.

eg maybe

pub struct NetworkManager<C> {
..
    metrics: NetworkMetrics,
    incoming_disconnect_metrics: DisconnectMetrics,
    outgoing_disconnect_metrics: DisconnectMetrics,
    dial_metrics: DialMetrics

@Rjected

That makes sense. BTW just unassigned, because having people assigned on larger issues like this seems to discourage others from working on them, so I'm taking a more FCFS approach to reviewing PRs

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 21 days with no activity.

emnul commented 1 month ago

I'm working on adding eth handshake error metrics. Looks like the changes made in https://github.com/paradigmxyz/reth/pull/3729 were never added to the etc/grafana/dashboards/overview.json file. I can open a separate PR to add those metrics.

emhane commented 1 month ago

I'm working on adding eth handshake error metrics. Looks like the changes made in #3729 were never added to the etc/grafana/dashboards/overview.json file. I can open a separate PR to add those metrics.

that is an RLPx metric! it needs to go in the tx pool dashboard not the discovery dashboard cc @Rjected

emhane commented 1 month ago

all issues here need to take into account https://github.com/paradigmxyz/reth/issues/8150

kuronosec commented 1 month ago

I am applying to this issue via OnlyDust platform.

My background and how it can be leveraged

Hello guys, I have an extensive experience on devops, security and monitoring. For instance, I am the lead developer of Tikuna (https://tikuna.io/) a security monitoring tool for Ethereum. I think I could help implementing one or several of the measurement metrics.

How I plan on tackling this issue

I could use as example the provided go-ethereum code and provide similar implementations on Reth plus the required Grafana dashboards.