Towards better visibility, debuggability and diagnostics

raulk commented 5 years ago

The DHT is a pretty central element of the libp2p stack. As our adoption grows, users demand better visibility, debuggability and diagnostics. This issue pulls together ideas we've discussed.

Metrics

We need a way to collect and expose metrics on a per-query basis (and return a stats object as a third argument from methods), as well as global moving aggregates/accumulators that can be queried anytime (or dumped periodically through an exporter like Prometheus).

When looking up a value, how many peers I did I query? How many queries were responded with a value vs. with closer peers? What were the min/avg/max RTT times?
When storing a value, how many peers did I store it in?
When looking up a peer, how many peers did I have to ask?
min/avg/max RPC times per message per operation.
failure counting.

Debuggability/diagnostics

Introspective queries like the following will provide better management and diagnostics of the DHT.

What addresses of mine are stored in the DHT? Are they as expected?
What DHT records do I currently hold? Who have I served them to?
When did a record get created? Which peer ID stored the record? When was it last queried?
What provider records do I hold? When do they expire? Are the nodes I'm pointing to still alive?
Dump the routing table. Trace routing table changes.

Some of these require additional bookkeeping. Some are too expensive/voluminous to track by default: they should be switched off OOTB, and users should opt-in explicitly knowing the implications.

Queriability

Collecting this wealth of information would be fruitless if we didn't expose it to the user via tooling. Unfortunately libp2p lacks an instrumentation/monitoring/management subsystem (for now) to serve as a sink for all this data. A transitory, simple solution is to expose these metrics via a local gRPC endpoint or similar, and develop a command line tool (similar to ipfs dht) that serves as a frontend.

anacrolix commented 5 years ago

I think this issue is too general.

What addresses of mine are stored in the DHT? Are they as expected?

This can be done with a query to the DHT to find out.

What DHT records do I currently hold? Who have I served them to?

This can be determined by looking at the appropriate data store.

When did a record get created? Which peer ID stored the record? When was it last queried?

I'm not sure about this. Why? It's a lot of metadata.

What provider records do I hold? When do they expire?

Datastore again.

Are the nodes I'm pointing to still alive?

Routing table.

Dump the routing table.

Routing table

Trace routing table changes.

This is an interesting one, and very useful for debugging. A logger subsystem, or a few callbacks to allow users to interpret it how they wish would achieve this.

Can we process the specifics and generate specific issues from this? We need to keep focused.

raulk commented 5 years ago

@anacrolix Sure, go ahead. If you don't mind, just add backlinks from the children issues into this one, so we can treat it as an epic.

Kubuxu commented 5 years ago

One debug metric I wanted for a long time is a number of items in each of kbuckets being exported as a metric. This would allow to debug/discover some possible implementation errors.

raulk commented 5 years ago

@Kubuxu On that subject, take a look at this: https://github.com/libp2p/go-libp2p-kad-dht/issues/194. I can tell you the answer already: 7 furthest buckets are full, 8th is half full, the remaining 248 logical buckets are empty with an extremely high likelihood.

P.S.: But yeah, that metric makes sense as a digest of the full routing table dump.

anacrolix commented 5 years ago

Can we close this and create a metrics label? Super issues are too fluffy and conversation will be interleaved across different metrics.

raulk commented 5 years ago

Let’s do both. Keep this one as an epic that serves like a user/passer-by entrypoint for discussion. Also open issues for the specific stuff we’ve decided to implement. I like the label.

anacrolix commented 5 years ago

All the metrics stuff can be addressed by #252, #300, and #297.

anacrolix commented 5 years ago

A list of metrics is tracked in #304.

daviddias commented 4 years ago

What's the overall state of metrics in libp2p? Right now I'm specially interested in two -- https://discuss.libp2p.io/t/how-to-know-of-peers-dialed-of-dials-failed-per-each-find-peers-find-providers-query/341/4 --

daviddias commented 4 years ago

For reference: Here is the url to the docs of the Stats API in js-libp2p that @pgte created long time ago -- https://github.com/libp2p/js-libp2p#switch-stats-api

daviddias commented 4 years ago

Can we get the metrics by query exported https://github.com/libp2p/go-libp2p-kad-dht/blob/master/query.go#L106-L110 ? It would help me understand the efficiency of our routing

raulk commented 4 years ago

@daviddias those details would be part of a trace, because they are transactional metrics, i.e. they pertain to a particular transaction in the system. I don't think there's much value in calculating averages, counts and percentile distributions globally (which is what OpenCensus metrics are about -- runtime stats).

daviddias commented 4 years ago

@daviddias those details would be part of a trace, because they are transactional metrics, i.e. the pertain to a particular transaction in the system.

That would work for the usecases I can think of 👍

Update: Ah! When I said export, I wasn't thinking in the "Export from the Golang package sense". I was just looking to have access to the information, hence a trace would be perfect!

libp2p / go-libp2p-kad-dht