Trace SDK observability

dashpole commented 2 years ago

Problem Statement

Context: https://github.com/kubernetes/enhancements/pull/3161#discussion_r790716355

For an application instrumented with OpenTelemetry for tracing, and using the OTLP trace exporter, it isn't currently possible to monitor (with metrics) whether or not spans are being successfully collected and exported. For example, if my SDK cannot connect to an opentelemetry collector, and isn't able to send traces, I would like to be able to measure how many traces are collected, vs how many are not sent. I would like to be able to set up SLOs to measure successful trace delivery from my applications.

Proposed Solution

After the metrics API is stable, collect metrics in the trace SDK using the metrics API. Specifics about the metrics deserve their own design, but I should be able to tell the volume of spans my application is generating, and the success rate of exporting them. This would be done via a new TracerProviderOption: WithMeterProvider(MeterProvider).

Alternatives

We could add metrics to exporters individually, but most exporter-related metrics should be similar.

MrAlias commented 2 years ago

In the meantime, while we wait for metrics to be stable enough for this, I created this: https://github.com/MrAlias/flow

@dashpole let me know if that helps.

dashpole commented 2 years ago

Very cool. I'll take a look

dashpole commented 2 years ago

It probably isn't quite enough to meet the needs I have, but may be useful for others

thehackercat commented 2 years ago

we also need this.

MrAlias commented 7 months ago

@MadVikingGod is going to look into what metrics should be added and the feasibility of this feature.

MadVikingGod commented 7 months ago

Java does implement some metrics around the BatchSpanProcessor (BSP) and a generic wrapper for some (at least grpc, maybe more) exporters. The metrics below will indicate if I found them in Java.

How could this be implemented?

Experimental

To experiment with it and not include any API surface we can start with an experimental Environment Variable. This will indicate if we should use the global Metrics API. Doing this should allow us to explore the performance impact of any of the metrics while still maintaining compatibility.

Option API

This would add a number of WithMeterProvider() to anywhere that would produce these metrics. This could either act as an enable signal, only capture metrics if it's configured or an override signal, override using the global API.

This can realistically only be done for Objects that already use an option pattern, like the TracerProvider or the BatchSpanProcessor, which would prevent some components from having an override, like the SimpleSpanProcessor. We won't need an option for Samplers, because we can measure the output of this decision without instrumenting the internals of this code.

If we were to add an option for both TP and BSP, this would mean we would need a new type that is the union of both Options, similar to SpanStartEventOption

What should be instrumented

This is a non-exhaustive list of things that could be captured

From the Tracer

Number of Spans Started
- Was it Sampled
Number of Spans Ended From the BSP
Number of spans exported (This is in Java)
Number of Spans Dropped (This is in Java)
Number of Spans currently in the queue (This is in Java)

From an exporter

Number of Exports
Number of Retries
Duration of Export
Number of Spans Exported
Number of Spans rejected

logan-stytch commented 4 months ago

We're very interested in this feature so we can tune our Batcher to ensure it doesn't inadvertently drop spans. I started a WIP PR (#5201), but it definitely needs some guidance. If there are already plans to release metrics in the near-to-mid-term, we can wait, but otherwise, this seemed like a well-scoped area where we could help contribute (especially using the Java implementation as a reference).

open-telemetry / opentelemetry-go