open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.05k stars 2.36k forks source link

[Connector/Servicegraph] Servicegraph Connector are not giving correct metrics of spans #34170

Open VijayPatil872 opened 3 months ago

VijayPatil872 commented 3 months ago

Component(s)

connector/servicegraph

What happened?

Description

I am using servicegraph connector to generate service graph and metrics from span. the metrics are emitted by the connector are fluctuating up and down. We are using service graphs connector to build service graph. We have deployed a layer of Collectors containing the load-balancing exporter in front of traces Collectors doing the span metrics and service graph connector processing. The load-balancing exporter is used to hash the trace ID consistently and determine which collector backend should receive spans for that trace. the service graph exporting the metrics to Grafana mimir with prometheusremotewrite exporter.

Steps to Reproduce

Expected Result

The metrics are emitted by the connector should be correct

Actual Result

image

Collector version

0.104.0

Environment information

No response

OpenTelemetry Collector configuration

config:        
  exporters:

    prometheusremotewrite/mimir-default-processor-spanmetrics:
      endpoint: 
      headers:
        x-scope-orgid: ********
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500        

    prometheusremotewrite/mimir-default-servicegraph:
      endpoint: 
      headers:
        x-scope-orgid: **********
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s  
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500

  connectors:
    spanmetrics:
      histogram:
        explicit:
          buckets: [100ms, 500ms, 2s, 5s, 10s, 20s, 30s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      metrics_expiration: 5m
      exemplars:
        enabled: false
      dimensions:
        - name: http.method
        - name: http.status_code
        - name: cluster
        - name: collector.hostname
      events:
        enabled: true
        dimensions:
          - name: exception.type
      resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name
    servicegraph:
      latency_histogram_buckets: [100ms, 250ms, 1s, 5s, 10s]
      store:
        ttl: 2s
        max_items: 10

  receivers:
    otlp:
      protocols:
        http:
          endpoint: ${env:MY_POD_IP}:*****
        grpc:
          endpoint: ${env:MY_POD_IP}:*****
  service:

    pipelines:
      traces/connector-pipeline:
        exporters:
          - otlphttp/tempo-processor-default
          - spanmetrics
          - servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - otlp

      metrics/spanmetrics:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-processor-spanmetrics
        processors:
          - batch          
          - memory_limiter
        receivers:
          - spanmetrics

      metrics/servicegraph:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - servicegraph

Log output

No response

Additional context

No response

github-actions[bot] commented 3 months ago

Pinging code owners:

VijayPatil872 commented 1 month ago

any update on the issue?

mapno commented 1 month ago

Can you provide more information on why metrics are incorrect? A test or test data that reproduces the behaviour would be very helpful

VijayPatil872 commented 3 weeks ago

@mapno If we consider traces_service_graph_request_total metrics or traces_service_graph_request_failed_total metrics, these should be counter, but it is seen fluctuating up and down. similarly for calls_total metrics in case of spanmetrics it should be a counter, but the graph is up & down at sometimes. Also Can you explain for me what kind of A test or test data you need as the configurations as applied above. Let me know for addition details required.