open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.69k stars 886 forks source link

Support decide sampling in all span lifecycle #4171

Open hmleo opened 1 month ago

hmleo commented 1 month ago

In the current header sampling mechanism, the samplingResult is depend on Instrumenter.start(), such as parent or traceIdRatio.

When the samplingResult create ends, there may be an invocation exception in instrument target method, and exception is usually the information we focus on. so, i think if we can support sample decide in span lifecycle(e.g. create / set attributes / end),it will be perfect.

for example, two scenes as below: 1.tail sampling: got 100% spans in one trace. but if we want sample all error spans, it require 100% sample. 2.new head sampling: support decide samle on span lifecycle in agent. Do not require 100% sample, it may lose some other service's spans in whole trace,may got only 50%spans in one trace, but the 50% is important. I think it's acceptable

So, can it support decide sample in all span lifecycle?

danielgblanco commented 1 month ago

Thanks @hmleo. Trace sampling aims to result in complete traces (or at least complete subtraces). If the sampling decision is postponed to any point after span creation (e.g. when the instrumented method raises an exception) then there would be no guarantees for trace completeness. This would not only result in missing spans at the root of a sub-trace, but also within it. For instance, consider this case where we have parent-based samplers configured:

Span_A (not sampled)
|____Span_AA (originally not sampled as parent not sampled, then sampled after exception)
|         |____Span_AAA (not sampled as parent not sampled)
|         |________________Span_AAB (sampled after sampling decision changed of parent changed)
|________________Span_AB (not sampled as parent not sampled)

If we allowed the sampling decision to be changed after span creation, this would be represented as:

Span_AA (orphan span)
|________________Span_AAB

As you can see, this would not only result in Span_AA missing its parent (which may be acceptable) but also in some information loss under Span_AA which may be critical to build a series of events, especially if the output from the operation represented as Span_AAA is used within the operation represented as Span_AAB. This scenario (missing leaf spans) would also be challenging to identify, as we wouldn't know what spans are missing.

Having completeness guarantees is important, and something that Consistent probability sampling aims to solve for head-based samplers (for a different use case). For your use case, the recommendation is to use out-of-process sampling mechanisms, like the tailsampling processor in the Collector. This would allow to generate complete traces.

Let me know if this solves your needs. In any case, I'll leave this up for the community to give more feedback to be considered as part of this issue.