open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.12k stars 2.39k forks source link

Tail-sampling composite policy should have a flag to fill unused sub policy budgets. #35971

Open Tarmander opened 1 month ago

Tarmander commented 1 month ago

Component(s)

processor/tailsampling

Is your feature request related to a problem? Please describe.

For our tail-sampling use case, we use the composite policy to set a rate limit and bucket spans on different criteria. One of the disadvantages of this approach is that a percentage of the rate is "reserved" for that bucket (example: high latency spans). If there are no high latency spans for a period of time, we will miss out on that sub policies percentage of the limit.

In practice that means we're seeing post tail-sampling throughputs much lower than our actual budget that we've set.

Describe the solution you'd like

We would like the ability to set some flag in the composite policy that ensures that if a given sub policy won't take advantage of it's total budget, that budget gets added to an always-sample policy.

This could look like : fill_remaining_budget: true.

Describe alternatives you've considered

One alternative is artificially increasing our max_total_spans_per_second value above our actual SPS budget to see the throughput we would like. The disadvantage there is, in the case where all sub policies are satisfied to their capacity, we will be well over budget.

Additional context

Current tail sampling configuration:

tail_sampling/catchall:
  decision_wait: 120s
  num_traces: 1000000
  policies:
    - name: composite-policy-catchall
      type: composite
      composite:
        max_total_spans_per_second: 4000
        policy_order: [ latency-policy, http-error-policy, exception-policy, probabilistic-policy, always-sample-remaining-policy ]
        composite_sub_policy:
          - name: latency-policy
            type: latency
            latency:
              threshold_ms: 400
          - name: http-error-policy
            type: numeric_attribute
            numeric_attribute:
              key: http.status_code
              min_value: 400
              max_value: 600
          - name: exception-policy
            type: string_attribute
            string_attribute:
              key: exception.message
              values: [ .* ]
              enabled_regex_matching: true
          - name: probabilistic-policy
            type: probabilistic
            probabilistic:
              sampling_percentage: 40
          - name: always-sample-remaining-policy
            type: always_sample
        rate_allocation:
          - policy: latency-policy
            percent: 20
          - policy: http-error-policy
            percent: 10
          - policy: exception-policy
            percent: 10
          - policy: probabilistic-policy
            percent: 20
          - policy: always-sample-remaining-policy
            percent: 40
github-actions[bot] commented 1 month ago

Pinging code owners: