Open Olshansk opened 2 years ago
I had a really positive experience with OpenTelemetry for distributed tracing and SigNoz as a "APM" (application performance monitoring) dashboard in recent history and wanted to share. The OpenTelemetry ecosystem contains "automatic instrumentation" libraries for several languages / protocols / frameworks. There's no such official library for Golang as far as I could tell but there is this a community library that supports stdlib net/http
, gRPC, and Gorilla mux (Apache 2 license). I don't think there's anything there for us to take advantage of already but we could be in a position to contribute auto instrumentation for libp2p, for example.
For whoever picks this up in the future, I'd probably lean more on @bryanchriswhite's suggestion, but just wanted to document a tool a came by for a future reference.
Odigos - Observability Control Plane claims to Generate distributed traces instantly for any application without code changes
and they have support for Prometheus & Loki.
Also, in my opinion, the hardest (still undefined) part of this ticket is determining the "entry &exit points + scope" of the distributed tracing ID so it is useful across a single pocket node, and potentially across different actors/nodes as well.
@bryanchriswhite @Olshansk Although the automatic instrumentation is not supported in Go, from a quick look it seems the golang OTel library is stable but only for traces. This seems fair as this PR is focused on distributed tracing but may be a problem if we wish to incorporate logging into the same workflow down the line as they mention in the repo that log support is not planned until metrics support is used in addition to traces. They also have trace exporting supported to Jaeger, Zipkin and OTLP. The exact tracing platform could be decided upon once we have agreement on whether we will use this library (I do +1 @bryanchriswhite on this choice)
I would be interested in taking this on - I have previously worked with Cribl, Splunk and Dynatrace before using the "3 pillars of observability" but I would need to do some more research implementing it on our end.
As for determining the entry/exit points + scope of the tracing this is a big thing. My thoughts on this could be to 1) implement tracing as its own "module" of sorts that could be plugged in when are where it is needed 2) identify the first ideal entry/exit point that is relatively simple 3) in a follow up PR expand upon the above by integrating tracing throughout the codebase
This seems like quite a large PR but could be a lot of fun, and a good opportunity to see where the community would like more observability into what is going on with their nodes etc. They could also be a good way to find the ideal entry/exit points in general.
Odigos - Observability Control Plane claims to Generate distributed traces instantly for any application without code changes and they have support for Prometheus & Loki.
This is an interesting project that seems to utilise OTel anyway will continue to look into if this is a better solution. The no code changes sounds a little too good to be true 😅
Interesting links: [1] https://opentelemetry.io/docs/instrumentation/go/ [2] https://opentelemetry.io/docs/instrumentation/go/getting-started/ [3] https://github.com/keyval-dev/opentelemetry-go-instrumentation [4] https://github.com/exaring/otelpgx
opentelemetry-go definitely seems like a good approach here, and something we should revisit when picking this up.
I do think that this is a slightly lower priority for now. @deblasis and @bryanchriswhite are adding Context
in the p2p & persistence libraries and looking at the documentation, it seems like it may heavily rely on it:
_, span := otel.Tracer(name).Start(ctx, "Poll")
I also believe that once @okdas implements our DevNet, we'll have more visibility into what the entry & exit points should be.
Unrelated, I think we can find some bounty-related work surrounding IBC, light clients, etc... 😉
Objective
Add distributed tracing through the V1 Node.
Origin Document
Distributed Tracing is a pretty common pattern used to track the lifecycle and movement of a request throughout/across infrastructure/codebase
Goals / Deliverables
General issue checklist
Non-goals
Testing Methodology
- TODO: Need to define this
make test_all
LocalNet
is still functioning correctly by following the instructions at docs/development/README.md RemoveCreator: @Olshansk Co-Owners: @okdas @phthan0