status-im / telemetry

Opt-in message reliability metrics service
Other
2 stars 3 forks source link

Discovery metrics #21

Open fryorcraken opened 1 week ago

fryorcraken commented 1 week ago
  1. Report number of nodes found, per discovery strategy (if possible)
  2. Report number of nodes successfully connected to
  3. Report number of nodes that were freshly discovered but connection failed
  4. Report whether own node is marked as discoverable. It seems know own tcp port is the best bet
chaitanyaprem commented 1 week ago

Sounds good, we need to make sure that point-1 reports number of unique nodes found..not same ones (which has been noticed with current discv5).

Wrt point-4: we can report this based on response from AutoNAT which is not exposed to status-go as of now. This is something that can be included in go-waku as an API or maybe we can access the libp2p API directly. @richard-ramos has better idea about this.

richard-ramos commented 1 week ago

Report whether own node is marked as discoverable. It seems know own tcp port is the best bet

There are different sources to know if your node is discoverable or not:

  1. Subscribing to reachability changes from go-libp2p: a node can be Private or Public, with private nodes still being accessible thru circuit relay. Public nodes do have external IP address.
  2. Having a circuit relay address in the ENR.

It's also worht taking into account that a node can change their reachability during runtime (like for example if you switch networks or if the circuit relay node goes offline).

fryorcraken commented 1 week ago

I think it would be good to also add:

  1. whether a node was able to get an external port and ip via nat negotiation
  2. whether holepunching worked via circuit relay
adklempner commented 5 days ago

An interesting idea brought up by @danisharora099 is that if a node has a bunch of discovered peers but cannot establish any connections, the telemetry service can also try connecting to those peers to determine if they are truly inaccessible, or if the node reporting the metrics is misconfigured, or if the peers are just not reachable from the node's environment

chaitanyaprem commented 2 days ago

An interesting idea brought up by @danisharora099 is that if a node has a bunch of discovered peers but cannot establish any connections, the telemetry service can also try connecting to those peers to determine if they are truly inaccessible, or if the node reporting the metrics is misconfigured, or if the peers are just not reachable from the node's environment

Maybe telemetry service doesn't need to do this...rather just collect data from various nodes that report connectivity status of a peer and record that information. i.e how many nodes were able to connect to it and how many failed. It could be possible that peer has reached connection limit and hence disconnecting connections or some other reason. We can probably deduce such info by gathering from other nodes rather than telemetry service doing this itself.

chaitanyaprem commented 2 days ago

An interesting idea brought up by @danisharora099 is that if a node has a bunch of discovered peers but cannot establish any connections, the telemetry service can also try connecting to those peers to determine if they are truly inaccessible, or if the node reporting the metrics is misconfigured, or if the peers are just not reachable from the node's environment

Maybe telemetry service doesn't need to do this...rather just collect data from various nodes that report connectivity status of a peer and record that information. i.e how many nodes were able to connect to it and how many failed. It could be possible that peer has reached connection limit and hence disconnecting connections or some other reason. We can probably deduce such info by gathering from other nodes rather than telemetry service doing this itself. It can be a case that ip-colocation-limit is reached due to which the node is rejecting/dropping connections.

We have many safety checks like this to prevent from a node getting targetted