open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.07k stars 2.37k forks source link

[tailsampling] Error sending late arrived spans. Failed to find dimensions. #18025

Closed edenkoveshi closed 1 year ago

edenkoveshi commented 1 year ago

Component(s)

processor/tailsampling

What happened?

Description

Hi, I am using the OpenTelemetry collector with the tail sampling processor (along some other processors), deployed with the OpenTelemetry Kuberentes Operator. Traces generated mainly using the auto instrumentation feature. I am receiving the following warnings: tailsamplingprocessor@v0.66.0/processor.go:379 Error sending late arrived spans to destination {"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "policy": "error" , "error": "value not found in metricsKeyToDimensions cache by key \"**some-service**\x00some-operation\x00\SPAN_KIND_INTERNAL\x00STATUS_CODE_UNSET\". failed to build metrics: failed to find dimensions for \"**another-service**\u0000**another-service-2**\", "errorCauses": [{"error":"value not found in metricsKeyToDimensions cache by key \"**some-service**\x00some-operation\x00\SPAN_KIND_INTERNAL\x00STATUS_CODE_UNSET\"} ,{"error": "failed to build metrics: failed to find dimensions for \"**another-service**\u0000**another-service-2**\""}] }

(The ** are for emphasizing, they do not appear in the original log)

To my understanding, the processor fails while generating its` metrics, though I am not sure what dimensions the traces must have so it would work properly.

I have had similar errors with spanmetrics and servicegraph processor but they were fixed when I have upgraded to 0.66.0

The collector is deployed behind another collector that serves as a load balancer as documeneted here. However, the error seems to occur even when removing the load balancer.

Here is the processor configuration:

      policies:
      [
          {
            name: error
            type: status_code,
            status_code: {status_codes: [ERROR, UNSET]}
          },
          {
            name: latency
            type: latency,
            latency: {threshold_ms: 200 }
          },
          {
            name: db-connection,
            type: string_attribute,
            string_attribute: {key: db.connection_string , values: [.*], enabled_regex_matching: true}
          },
          {
            name: filter-health-checks,
            type: string_attribute,
            string_attribute: {key: http.url , values: [ \/health ], enabled_regex_matching: true, invert_match: true}
          },
          {
            name: probablistic,
            type: probabilistic,
            probabilistic: {sampling_percentage: 25}
          },
    ]

Collector version

contrib 0.66.0

Environment information

Environment

Openshift 4.10 Deployed with OpenTelemetry operator 0.66.0 Collector contrib 0.66.0

OpenTelemetry Collector configuration

receivers:
  otlp:
  otlp/dummy:
processors:
  batch:
  servicegraph:
    metrics_exporter: prometheus
    dimensions: []
  spanmetrics:
    metrics_exporter: prometheus
    dimensions: []
  tailsampling: described above
exporters:
  jaeger:
    endpoint: my-jaeger-collector
  otlp:
    endpoint: some-other-otlp-collector
  logging:
  prometheus:
    endpoint: "0.0.0.0:8889"
extensions:
  health_check:
service:
  extensions:
   - health_check
  pipelines:
   traces:
     receivers:
     - otlp
     processors:
     - tail_sampling
     - spanmetrics
     - servicegraph
     - batch
    exporters:
    - logging
    - jaeger
    - otlp
  metrics:
   receivers: [otlp,otlp/dummy]
   exporters: [prometheus]
 telemetry:
  metrics:
  logs:

Log output

No response

Additional context

No response

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling commented 1 year ago

This seems to be an issue returned by spanmetrics. It shows as a problem in the tail-sampling processor because that's the component logging the error.

Could you please try the latest version of the collector (v0.70.0) and report back? In the meantime, I'll label this as a problem with the spanmetrics component.

github-actions[bot] commented 1 year ago

Pinging code owners for processor/spanmetrics: @albertteoh. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.