Proposal: OTel SDK should expose a metric to inform about sampling decisions made

samsp-msft commented 3 months ago

Package

OpenTelemetry

Is your feature request related to a problem?

Imagine the scenario that you have added a sampler to your Trace configuration, something like:

    builder.Services.AddOpenTelemetry()
        .WithTracing(tracing =>
        {
            tracing.AddAspNetCoreInstrumentation()
                .AddHttpClientInstrumentation()
               .SetSampler(new ParentBasedSampler(new RateLimitingSampler(3)))
        });

This will result in sampling decisions being made based on whether the incoming request has a trace header - if so it will honor that header, and if not it will sample a max of 3 requests per second.

When you look at the output you will get a combination of traces, but what you don't get is a good understanding of what traces got dropped and why.

What is the expected behavior?

I am suggesting that we have a new metric: `opentelemetry.trace.sampler.count`, implemented by the OTel SDK that provides details about the number of Activities that the sampler was called for and the sampling result. It should be dimensioned with:	Name	Type	Description
`sampling.decision`	string	The final verdict as to how the activity/span was flagged for sampling	`drop`, `record_only`, `record_and_sample`
`span.parent.is_remote`	bool	Whether the parent is a remote span or local	`true`
`span.parent.recorded`	bool	Whether the parent trace has the recorded flag set or not	`false`
`span.name`	string	The name of the span that is being sampled	`Microsoft.AspNetCore.Hosting.HttpRequestIn`
`sampler.description`	string	The description from the sampler that made the decision	`fixedratesampler{0.2}`

The span.name may be too varied to be suitable for use in a metric, in which case we should make this the ActivitySource.Name. The goal being to give the observer some idea of which spans/activities are being sampled each way.

The expected use of the metric is to have observability into the trace sampling decisions that are being made by the sdk. By looking at the sampling.decision, you get a measure of how many Activities are being dropped, just recorded and those emitted. The ratios of these numbers should match what you have configured in the sampler and the incoming request rate.

The additional fields are to enable better diagnostics as to why the sampling decision was made, but limiting it to the fields that have constrained enough values for use in metric dimensions.

Which alternative solutions or features have you considered?

While the sampling state is available through EventSource events, that is not easily monitored, and so including this in the metrics already produced makes more sense to me.

Additional context

This feels like something that would apply to other languages/sdks where head-based sampling occurs. The same metric could also be used for tail sampling in a component like the OTel collector.

samsp-msft commented 3 months ago

Tagging some folks: @reyang @lmolkova @CodeBlanch @cijothomas @noahfalk @kalyanaj

kalyanaj commented 3 months ago

Tagging @jmacd (to evaluate/consider this feedback for the specification).

cijothomas commented 3 months ago

I think this issue is best moved to spec/sem.convention repo, as this is independent of language implementation.

open-telemetry / opentelemetry-dotnet