open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.11k stars 2.39k forks source link

[Connector/Servicegraph] samples has been rejected because of same timestamp, but a different value (err-mimir-sample-duplicate-timestamp) #34169

Open VijayPatil872 opened 4 months ago

VijayPatil872 commented 4 months ago

Component(s)

connector/servicegraph

What happened?

Description

Currently we are facing an issue on open telemetry-collector for "service graph connector" that few samples has been rejected because another sample with the same timestamp, but a different value, has already been ingested. (err-mimir-sample-duplicate-timestamp) when the metrics are ingested to mimir.

We are using servicegraphs connector to build service graph. We have deployed a layer of Collectors containing the load-balancing exporter in front of traces Collectors doing the span metrics and service graph connector processing. The load-balancing exporter is used to hash the trace ID consistently and determine which collector backend should receive spans for that trace. the servicegraph exporting the metrics to Grafana mimir with prometheusremotewrite exporter. The mimir distributer failed to inject some of the metrics and gives the following error,

ts=2024-07-19T07:26:46.442694833Z caller=push.go:171 level=error user=default-processor-servicegraph msg="push error" err="failed pushing to ingester mimir-ingester-zone-a-2: user=default-processor-servicegraph: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-07-19T07:26:46.23Z and is from series traces_service_graph_request_client_seconds_bucket{client=\"claims-service\", connection_type=\"virtual_node\", failed=\"false\", le=\"0.1\", server=\"xxxxx.redis.cache.windows.net\"}"

could someone please help on eliminating this error

Steps to Reproduce

Expected Result

The metrics failure should be zero.

Actual Result

We see metrics failed because of above mentioned error on open telemetry dashboard as given below. image

Collector version

0.104.0

Environment information

No response

OpenTelemetry Collector configuration

config:        
  exporters:

    prometheusremotewrite/mimir-default-processor-spanmetrics:
      endpoint: 
      headers:
        x-scope-orgid: 
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500        

    prometheusremotewrite/mimir-default-servicegraph:
      endpoint: 
      headers:
        x-scope-orgid: 
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s  
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500

  connectors:
    spanmetrics:
      histogram:
        explicit:
          buckets: [100ms, 500ms, 2s, 5s, 10s, 20s, 30s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      metrics_expiration: 5m
      exemplars:
        enabled: false
      dimensions:
        - name: http.method
        - name: http.status_code
        - name: cluster
        - name: collector.hostname
      events:
        enabled: true
        dimensions:
          - name: exception.type
      resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name
    servicegraph:
      latency_histogram_buckets: [100ms, 250ms, 1s, 5s, 10s]
      store:
        ttl: 2s
        max_items: 10

  receivers:
    otlp:
      protocols:
        http:
          endpoint: ${env:MY_POD_IP}:4318
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
  service:

    pipelines:
      traces/connector-pipeline:
        exporters:
          - otlphttp/tempo-processor-default
          - spanmetrics
          - servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - otlp

      metrics/spanmetrics:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-processor-spanmetrics
        processors:
          - batch          
          - memory_limiter
        receivers:
          - spanmetrics

      metrics/servicegraph:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - servicegraph

Log output

No response

Additional context

No response

github-actions[bot] commented 4 months ago

Pinging code owners:

mapno commented 4 months ago

Hi @VijayPatil872. Yes, this is a very unfortunate issue of horizontally scaling the connector. A workaround is adding a label to the metrics that corresponds to the collector pod name or something unique that makes the series unique across instances.

VijayPatil872 commented 4 months ago

Hi @mapno will you please suggest/elaborate more on how to add label to the metrics that corresponds to the collector pod name or something unique that makes the series unique across instances.

mapno commented 4 months ago

Hi @VijayPatil872. I believe something like the k8sattributesprocessor should work for that. With it, you can add a label like k8s.pod.name to your metrics and make the series unique between instances.

VijayPatil872 commented 4 months ago

Hi @mapno some workaround done with k8sattributesprocessor, it is seen that label mentioned in configuration are seen in otel logs, but issue still persists. it is not worked for me.

mapno commented 4 months ago

Do the metrics now have k8s.pod.name as label and you still get the same errors?

VijayPatil872 commented 4 months ago

Hi @mapno the k8sattributesprocessor with following configuration added,

 k8sattributes:
         auth_type: "serviceAccount"
         passthrough: false
        extract:
          metadata:
          - k8s.namespace.name
          - k8s.deployment.name
          - k8s.statefulset.name
          - k8s.daemonset.name
          - k8s.cronjob.name
          - k8s.job.name
          - k8s.node.name
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.pod.start_time
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.namespace.name
          - from: resource_attribute
            name: k8s.pod.name

It is seen in open telemetry collector logs the labels are getting added whichever available still issue persists.

github-actions[bot] commented 2 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

MayurCXone commented 1 month ago

not stale.