open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.12k stars 2.39k forks source link

Add policy to spans in sampled traces #35180

Open kawaiwanyelp opened 2 months ago

kawaiwanyelp commented 2 months ago

Component(s)

processor/tailsampling

Is your feature request related to a problem? Please describe.

There is no currently no way to tell by looking at a sampled trace which policy sampled it. If we could, we'd be able to analyze traces by policy, direct traces to different pipelines by policy, and better understand the sampling behavior of this processor.

Describe the solution you'd like

I propose that we add the name of the first top-level policy that returns a positive sampling decision to each span in a trace as an span context attribute. For example, given the following policies:

policies:
  tail_sampling:
    policies:
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
      - name: http-error-policy
        type: and
        and:
          and_sub_policy:
            - name: error-code-policy
              type: ottl_condition
              ottl_condition: 
                span:
                  - 'IsMatch(attributes["http.response.status_code"], "^[45][0-9][0-9]$")'
            - name: probabilistic-policy
              type: probabilistic
              probabilistic:
                sampling_percentage: 100

And given a trace which includes a span with a 404 status code. Say the probabilistic-policy returned a NotSampled decision, but the http-error-policy returned a Sampled decision. A span in this trace would look like:

{
    "name": "client_request",
    "kind": "SpanKind.CLIENT",
    "attributes": {
        "http.response.status_code": 404,
        "sampling.policy": "http-error-policy"  // new attribute added
    },
    ...
}

Describe alternatives you've considered

Additional context

I'd be willing to create a PR for this if maintainers agree that this is a good idea.

github-actions[bot] commented 2 months ago

Pinging code owners:

jpkrohling commented 2 months ago

I'm not keen on adding another attribute to all spans passing through the tail-sampling processor but I'd be willing to look at PR for this if this is placed behind a feature gate.

andy-paine-numan commented 2 months ago

This would be really useful for debugging new sampling configs in particular.

Add the policy to just the root span: This would mean less work for the processor to do, but wouldn't be great if the root span was somehow dropped or lost

I don't think the risk of losing the root span is very high here since we are using the tailsamplingprocessor, we already know that we've ingested the root span into our collector pipeline for the policies to run. Given this would mean only adding a single resource attribute/attribute rather than adding attributes to every span, this feels like a good solution to me.

djluck commented 2 weeks ago

Just to add that we have a use case where such an attribute could be very helpful: understanding latency distributions. We combine the probabilistic policy with the latency policy and when attempting to analyze operation latency for our services, it's hard to get an accurate picture as the latency policy (as by design) biases towards slow traces.

If the policy was recorded as an attribute, we'd be able to calculate latency over only the traces captured by the probabilistic policy.

djluck commented 2 weeks ago

@jpkrohling I've made an attempt at resolving this issue in this PR, I'd be very grateful if you could review this at some point.

mrsimo commented 1 week ago

We'd absolutely love this feature, even better if we can enable it conditionally. Every time I need to troubleshoot our tail sampling configuration I'd kill for it.