Olshansk commented 2 years ago

Objective

Add distributed tracing through the V1 Node.

Origin Document

Distributed Tracing is a pretty common pattern used to track the lifecycle and movement of a request throughout/across infrastructure/codebase

Goals / Deliverables

[ ] Pick a tool to use for distributed tracing (Jaeger, Zipkin, monkit, etc...)
[ ] Identify a set of initial entry/exit points where it can be injected or removed
[ ] Design how a tracing ID should be implemented in the V1 codebase
[ ] Implement the design determined above
[ ] Build the necessary logging utilities needed to track, filter and have visibility into it
[ ] Compile a list of follow up telemetry/tracing needs after this is done

General issue checklist

[ ] Update the appropriate CHANGELOG
[ ] Update the README
[ ] Update the source code tree explanation
[ ] Add a new sequence or flowchart diagram using mermaid
[ ] Update any relevant global documentation & references
[ ] Document small issues / TODOs along the way

Non-goals

Exhaustive tracing through the entire codebase and across all the infrastructure we'll need. This ticket is a starting point and it is implicit that gaps will remain.

Testing Methodology

- TODO: Need to define this

All tests: make test_all
LocalNet: verify a LocalNet is still functioning correctly by following the instructions at docs/development/README.md Remove

Creator: @Olshansk Co-Owners: @okdas @phthan0

bryanchriswhite commented 1 year ago

I had a really positive experience with OpenTelemetry for distributed tracing and SigNoz as a "APM" (application performance monitoring) dashboard in recent history and wanted to share. The OpenTelemetry ecosystem contains "automatic instrumentation" libraries for several languages / protocols / frameworks. There's no such official library for Golang as far as I could tell but there is this a community library that supports stdlib net/http, gRPC, and Gorilla mux (Apache 2 license). I don't think there's anything there for us to take advantage of already but we could be in a position to contribute auto instrumentation for libp2p, for example.

Olshansk commented 1 year ago

For whoever picks this up in the future, I'd probably lean more on @bryanchriswhite's suggestion, but just wanted to document a tool a came by for a future reference.

Odigos - Observability Control Plane claims to Generate distributed traces instantly for any application without code changes and they have support for Prometheus & Loki.

Also, in my opinion, the hardest (still undefined) part of this ticket is determining the "entry &exit points + scope" of the distributed tracing ID so it is useful across a single pocket node, and potentially across different actors/nodes as well.

h5law commented 1 year ago

@bryanchriswhite @Olshansk Although the automatic instrumentation is not supported in Go, from a quick look it seems the golang OTel library is stable but only for traces. This seems fair as this PR is focused on distributed tracing but may be a problem if we wish to incorporate logging into the same workflow down the line as they mention in the repo that log support is not planned until metrics support is used in addition to traces. They also have trace exporting supported to Jaeger, Zipkin and OTLP. The exact tracing platform could be decided upon once we have agreement on whether we will use this library (I do +1 @bryanchriswhite on this choice)

I would be interested in taking this on - I have previously worked with Cribl, Splunk and Dynatrace before using the "3 pillars of observability" but I would need to do some more research implementing it on our end.

As for determining the entry/exit points + scope of the tracing this is a big thing. My thoughts on this could be to 1) implement tracing as its own "module" of sorts that could be plugged in when are where it is needed 2) identify the first ideal entry/exit point that is relatively simple 3) in a follow up PR expand upon the above by integrating tracing throughout the codebase

This seems like quite a large PR but could be a lot of fun, and a good opportunity to see where the community would like more observability into what is going on with their nodes etc. They could also be a good way to find the ideal entry/exit points in general.

Odigos - Observability Control Plane claims to Generate distributed traces instantly for any application without code changes and they have support for Prometheus & Loki.

This is an interesting project that seems to utilise OTel anyway will continue to look into if this is a better solution. The no code changes sounds a little too good to be true 😅

Interesting links: [1] https://opentelemetry.io/docs/instrumentation/go/ [2] https://opentelemetry.io/docs/instrumentation/go/getting-started/ [3] https://github.com/keyval-dev/opentelemetry-go-instrumentation [4] https://github.com/exaring/otelpgx

Olshansk commented 1 year ago

opentelemetry-go definitely seems like a good approach here, and something we should revisit when picking this up.

I do think that this is a slightly lower priority for now. @deblasis and @bryanchriswhite are adding Context in the p2p & persistence libraries and looking at the documentation, it seems like it may heavily rely on it:

_, span := otel.Tracer(name).Start(ctx, "Poll")

I also believe that once @okdas implements our DevNet, we'll have more visibility into what the entry & exit points should be.

Unrelated, I think we can find some bounty-related work surrounding IBC, light clients, etc... 😉

pokt-network / pocket

[Telemetry] Distributed Tracing in a V1 Node #143