metric otelcol_processor_tail_sampling_count_traces_sampled showing wrong values based on policy, sampled labels

0x006EA1E5 commented 1 year ago

Component(s)

What happened?

Description

For the tail sampling processor, metric otelcol_processor_tail_sampling_count_traces_sampled, we get the same value for every policy. I believe it should count sampled = true or false differently for each policy, indicating if the policy matched a given trace

Steps to Reproduce

Create a config file for the collector, with the tail_sampling processor with two distinct policies. For example, one for error traces, and one for high latency traces. Also enable the prometheus scrape endpoint. Something like:

processors:
  tail_sampling:
    decision_wait: 10s
    expected_new_traces_per_sec: 100
    policies:
      [

        {
          name: sample-all-error-traces,
          type: status_code,
          status_code: {status_codes: [ERROR]}
        },
        {
          name: sample-all-high-latency,
          type: latency,
          latency: { threshold_ms: 3000 }
        }
      ]

service:
  telemetry:
    metrics:
      level: detailed
      address: "0.0.0.0:8888"

Send a low latency, error trace. Check the metrics endpoint e.g., curl localhost:8888/metrics | grep otelcol_processor_tail_sampling_count_traces_sampled. You might have to wait a while to see something (the decision_wait time?) Eventually you should see two lines,

otelcol_processor_tail_sampling_count_traces_sampled{policy="sample-all-error-traces",sampled="true",service_instance_id="b8487c8c-4c5c-4334-9228-ee8f0389f8c0",service_name="otelcol-ocado",service_version="0.86.0"} 1
otelcol_processor_tail_sampling_count_traces_sampled{policy="sample-all-high-latency",sampled="true",service_instance_id="b8487c8c-4c5c-4334-9228-ee8f0389f8c0",service_name="otelcol-ocado",service_version="0.86.0"} 1

Note, for both policies, sampled="true" and the count is 1, even though only one policy should have matched.

Now send a "high latency" trace, that should match the other policy. When the metrics update, both policies (with label sampled="true") will read 2

Expected Result

In the above scenario, when we first only send a single error trace, we should get the metric for policy="sample-all-error-traces",sampled="true" to be 1, and policy="sample-all-high-latency",sampled="false" (note sampled is false) to be 1.

Then when we send a high latency trace, we should see 4 lines, two for each each policy, with sampled="true" and sampled="false"

Actual Result

For every trace that matches any policy (i.e., the trace is sampled), the counter for all policies sampled="true" increments. Likewise, for every trace that doesn't match any policy (i.e., the trace is not sampled), the counter for all policies sampled="false" increments.

Collector version

0.86.0

Environment information

Environment

docker image /otel/opentelemetry-collector:0.86.0@sha256:b8733b43d9061e3f332817f9f373ba5bf59803e67edfc4e70f280cb0afb49dd5

OpenTelemetry Collector configuration

extensions:
  memory_ballast:
    size_in_percentage: 34
  zpages:
    endpoint: 0.0.0.0:55679
  health_check:

receivers:
  otlp:
    protocols:
      grpc:
      http:
  prometheus:
    config:
      scrape_configs:
        - job_name: "otel-collector"
          scrape_interval: 15s
          static_configs:
            - targets: ["localhost:8888"]

processors:
  batch/metrics:
  batch/traces:
    send_batch_max_size: 1500
    send_batch_size: 1000
  tail_sampling:
    decision_wait: 30s
    expected_new_traces_per_sec: 100
    policies:
      [
        {
          name: sample-10%-all-traces,
          type: probabilistic,
          probabilistic: {sampling_percentage: 10}
        },
        {
          name: sample-all-error-traces,
          type: status_code,
          status_code: {status_codes: [ERROR]}
        },
        {
          name: sample-all-high-latency,
          type: latency,
          latency: { threshold_ms: 3000 }
        }
      ]
  memory_limiter:
    check_interval: 1s          # 1s is the default value
    limit_percentage: 80        # This define the hard limit
    spike_limit_percentage: 20  # Soft limit = limit_percentage - spike_limit_percentage

exporters:
  logging:
  otlp/grafana:
    endpoint: http://grafana
    headers:
      authorization: Basic ${TEMPO_BASIC_AUTH}
  prometheusremotewrite:
    endpoint: http://prometheus
    remote_write_queue:
      enabled: true
      queue_size: 5000  # increase the queue size to 5000
    target_info:
      enabled: false
    resource_to_telemetry_conversion:
      enabled: true

service:
  telemetry:
    metrics:
      level: detailed
      address: "0.0.0.0:8888"
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch/traces]
      exporters: [otlp/grafana, logging]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch/metrics]
      exporters: [prometheusremotewrite]
  extensions: [memory_ballast, zpages, health_check]

Log output

No response

Additional context

No response

github-actions[bot] commented 1 year ago

Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

0x006EA1E5 commented 1 year ago

/label processor/tailsampling

github-actions[bot] commented 1 year ago

Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

0x006EA1E5 commented 1 year ago

Oh, I guess this is a duplicate of https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/25882 🤦

crobert-1 commented 1 year ago

@0x006EA1E5 Should we close this issue as a duplicate, or do you want to re-test on v0.87.0 first to make sure the issue has been resolved?

0x006EA1E5 commented 1 year ago

Looks like this is already fixed in 0.87.0, but I will check on Monday if that's okay?

0x006EA1E5 commented 1 year ago

closing as fixed

open-telemetry / opentelemetry-collector-contrib