open-telemetry / opentelemetry-dotnet

The OpenTelemetry .NET Client
https://opentelemetry.io
Apache License 2.0
3.22k stars 765 forks source link

Proposal: OTel SDK should expose a metric to inform about sampling decisions made #5756

Open samsp-msft opened 3 months ago

samsp-msft commented 3 months ago

Package

OpenTelemetry

Is your feature request related to a problem?

Imagine the scenario that you have added a sampler to your Trace configuration, something like:

    builder.Services.AddOpenTelemetry()
        .WithTracing(tracing =>
        {
            tracing.AddAspNetCoreInstrumentation()
                .AddHttpClientInstrumentation()
               .SetSampler(new ParentBasedSampler(new RateLimitingSampler(3)))
        });

This will result in sampling decisions being made based on whether the incoming request has a trace header - if so it will honor that header, and if not it will sample a max of 3 requests per second.

When you look at the output you will get a combination of traces, but what you don't get is a good understanding of what traces got dropped and why.

What is the expected behavior?

I am suggesting that we have a new metric: opentelemetry.trace.sampler.count, implemented by the OTel SDK that provides details about the number of Activities that the sampler was called for and the sampling result. It should be dimensioned with: Name Type Description Example
sampling.decision string The final verdict as to how the activity/span was flagged for sampling drop, record_only, record_and_sample
span.parent.is_remote bool Whether the parent is a remote span or local true
span.parent.recorded bool Whether the parent trace has the recorded flag set or not false
span.name string The name of the span that is being sampled Microsoft.AspNetCore.Hosting.HttpRequestIn
sampler.description string The description from the sampler that made the decision fixedratesampler{0.2}

The span.name may be too varied to be suitable for use in a metric, in which case we should make this the ActivitySource.Name. The goal being to give the observer some idea of which spans/activities are being sampled each way.

The expected use of the metric is to have observability into the trace sampling decisions that are being made by the sdk. By looking at the sampling.decision, you get a measure of how many Activities are being dropped, just recorded and those emitted. The ratios of these numbers should match what you have configured in the sampler and the incoming request rate.

The additional fields are to enable better diagnostics as to why the sampling decision was made, but limiting it to the fields that have constrained enough values for use in metric dimensions.

Which alternative solutions or features have you considered?

While the sampling state is available through EventSource events, that is not easily monitored, and so including this in the metrics already produced makes more sense to me.

Additional context

This feels like something that would apply to other languages/sdks where head-based sampling occurs. The same metric could also be used for tail sampling in a component like the OTel collector.

samsp-msft commented 3 months ago

Tagging some folks: @reyang @lmolkova @CodeBlanch @cijothomas @noahfalk @kalyanaj

kalyanaj commented 3 months ago

Tagging @jmacd (to evaluate/consider this feedback for the specification).

cijothomas commented 3 months ago

I think this issue is best moved to spec/sem.convention repo, as this is independent of language implementation.