[connector/spanmetricsconnector] Generated counter drops then disappears

duc12597 commented 3 months ago

Component(s)

connector/spanmetrics

What happened?

Description

Our collector receives OTLP traces from Kafka, convert them into metrics and export to a TSDB. After a certain period of collector uptime (24-48 hours), the generated calls_total counter suffers a significant drop in value. Eventually no more metrics are exported. spanmetrics

Steps to Reproduce

Follow the below collector configuration.

Expected Result

The calls_total counter is ever increasing.

Actual Result

The calls_total counter drops then disappears.

Collector version

v0.101.0

Environment information

Environment

AWS EKS 1.24

OpenTelemetry Collector configuration

extensions:
  sigv4auth:
    region: ap-southeast-1
    service: "aps"
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 03-sink-metric-prometheus
          scrape_interval: 10s
          static_configs:
            - targets: ['127.0.0.1:8888']
  kafka/traces:
    protocol_version: 3.3.1
    brokers:
      - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
      - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
    auth:
      tls:
        insecure: true
    topic: otlp_spans
    group_id: 03-sink-metric-prometheus
  kafka/metrics:
    protocol_version: 3.3.1
    brokers:
      - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
      - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
    auth:
      tls:
        insecure: true
    topic: otlp_metrics
    group_id: 03-sink-metric-prometheus
processors:
  filter:
    error_mode: ignore
    metrics:
      datapoint:
        - 'IsMatch(attributes["http.target"], ".*.(css|js)")'
  transform:
    error_mode: ignore
    metric_statements:
      - context: datapoint
        statements:
          # reduce the cardinality of metrics with params
          - replace_pattern(attributes["http.target"], "/users/[0-9]{13}", "/users/{userId}")
connectors:
  spanmetrics:
    dimensions:
      - name: http.method
      - name: http.target
      - name: http.status_code
      - name: host.name
      - name: myCustomLabel
    exclude_dimensions:
      - span.kind
      - span.name
      - status.code
    exemplars:
      enabled: true
    metrics_flush_interval: 15s
exporters:
  debug:
  prometheusremotewrite:
    endpoint: https://aps-workspaces.ap-southeast-1.amazonaws.com/workspaces/<prometheus-workspace>/api/v1/remote_write
    auth:
      authenticator: sigv4auth
    external_labels:
      cluster_name: my-cluster
      collector: 03-sink-metric-prometheus
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 10s
      max_elapsed_time: 30s
    send_metadata: true
    max_batch_size_bytes: 3000000
service:
  telemetry:
    metrics:
      address: 127.0.0.1:8888
      level: detailed
  extensions:
    - sigv4auth
  pipelines:
    traces:
      receivers:
        - kafka/traces
      processors: []
      exporters:
        - spanmetrics
    metrics:
      receivers:
        - kafka/metrics
        - prometheus
        - spanmetrics
      processors:
        - filter
        - transform
      exporters:
        - debug
        - prometheusremotewrite

Log output

2024-06-07T01:38:51.776Z    error    exporterhelper/queue_sender.go:101    Exporting failed. Dropping data.    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", 
"error": "Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded", "errorCauses": [{"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}], "dropped_items": 58510}   
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
    go.opentelemetry.io/collector/exporter@v0.101.0/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
    go.opentelemetry.io/collector/exporter@v0.101.0/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
    go.opentelemetry.io/collector/exporter@v0.101.0/internal/queue/consumers.go:43

Additional context

Our application uses the HyperTrace Java agent to send telemetry data to Kafka in OTLP format
The problem persists across different TSDBs (AWS Prometheus, self-hosted Prometheus, Mimir) and different number of collector replicas (1, 3)

github-actions[bot] commented 3 months ago

Pinging code owners:

connector/spanmetrics: @portertech @Frapschen

See Adding Labels via Comments if you do not have permissions to add labels yourself.

ankitpatel96 commented 3 months ago

I have a few questions that might help us track down this issue: Is there any chance your collector is restarting at these points? Are you running just one collector or many in a gateway mode?

duc12597 commented 3 months ago

I'm running the collector as a deployment, and have tried both 1 and 3 replicas. The collector did not restart, I had to terminate the pods to keep exporting the metrics

ankitpatel96 commented 3 months ago

I see... honestly at this point I don't quite know what would cause it to eventually stop emitting metrics at all - that's the symptom that is really throwing me for a loop.

Are you still having these problems? Can you try increasing resource_metrics_cache_size? The thought is that this might prevent evictions which might prevent the resets.

Other things that might help us track down this problem - what is the count of the unique series within count_total over time? Are there resets happening for a series that the TSDB has already gotten or are there entirely new series?

duc12597 commented 3 months ago

This is the count(calls_total) at approximately the time the counter decreases

duc12597 commented 3 months ago

Further observation shows that out of 3 metrics receivers in my collector configuration, kafka/metrics & prometheus worked fine:

Only metrics from spanmetrics failed:

ankitpatel96 commented 3 months ago

thanks for your update. Did you try changing the cache size? I'm honestly a little stumped - any ideas @portertech @Frapschen ?

swar8080 commented 3 months ago

With the current config the connector will permanently cache every series it sees and send them all during each flush, even the ones where nothing's changed

So eventually the payload flushed to prometheusremotewrite gets so large that the remote write request times out (i.e. context deadline exceeded is a timeout) and likely the request gets rejected by the remote write target because of the size

Permanent error: context deadline exceeded"}], "dropped_items": 58510}

Possible things that could help are:

Setting metrics_expiration on the connector so that infrequently updated span metrics are removed. Then you have to deal with prometheus counter resets
Breaking up the remote write requests into smaller batches, possibly using batch processor and/or prometheusremote's built-in config
Switching to scraping the span metrics using prometheus since it's optimized for a large number of series

duc12597 commented 2 months ago

I set metrics_expiration: 30m, the metrics still disappeared altogther. It returned after ~6 hours, but somehow the collectors did not restart.

Frapschen commented 2 months ago

@duc12597 Have your try to switch push model to pull?. replace your prometheusremotewrite to prometheusexporter.

duc12597 commented 2 months ago

@duc12597 Have your try to switch push model to pull?. replace your prometheusremotewrite to prometheusexporter.

We will consider this option. As of now the collector has been running for 2 weeks without any errors, although there are still counter fluctuations. I'm not sure if it's thanks to any changes on our side. I will close this issue for now and will re-open in the future if this problem resurface.

This is my complete collector manifest:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: 03-sink-metric-prometheus
spec:
  image: mirror.gcr.io/otel/opentelemetry-collector-contrib:0.102.0
  replicas: 5
  nodeSelector:
    mycompany.com/service: observability
    kubernetes.io/arch: amd64
  tolerations:
    - effect: NoSchedule
      key: mycompany.com/service
      value: observability
      operator: Equal
  config: |
    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: 03-sink-metric-prometheus
              scrape_interval: 10s
              static_configs:
                - targets: ['127.0.0.1:8888']
      kafka/traces:
        protocol_version: 3.3.1
        brokers:
          - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
          - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
        auth:
          tls:
            insecure: true
        topic: otlp_spans
        group_id: 03-sink-metric-prometheus
      kafka/metrics:
        protocol_version: 3.3.1
        brokers:
          - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
          - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
        auth:
          tls:
            insecure: true
        topic: otlp_metrics
        group_id: 03-sink-metric-prometheus
    processors:
      filter:
        error_mode: ignore
        metrics:
          datapoint:
            - 'IsMatch(attributes["http.target"], ".*.(css|js)")'
      transform:
        error_mode: ignore
        metric_statements:
          - context: datapoint
            statements:
              # reduce the cardinality of metrics with params
              - replace_pattern(attributes["http.target"], "/users/[0-9]{13}", "/users/{userId}")
    connectors:
      spanmetrics:
        dimensions:
          - name: http.method
          - name: http.target
          - name: http.status_code
          - name: host.name
          - name: myCustomLabel
        exclude_dimensions:
          - span.kind
          - span.name
          - status.code
        exemplars:
          enabled: true
        metrics_flush_interval: 15s
        metrics_expiration: 1h
        resource_metrics_key_attributes:
          - service.name
          - telemetry.sdk.language
          - telemetry.sdk.name
        resource_metrics_cache_size: 10000
    exporters:
      debug:
      prometheusremotewrite:
        endpoint: http://mimir-nginx/api/v1/push
        send_metadata: true
    service:
      telemetry:
        metrics:
          address: 127.0.0.1:8888
          level: detailed
      extensions:
        - sigv4auth
      pipelines:
        traces:
          receivers:
            - kafka/traces
          processors: []
          exporters:
            - spanmetrics
        metrics:
          receivers:
            - kafka/metrics
            - prometheus
            - spanmetrics
          processors:
            - filter
            - transform
          exporters:
            - debug
            - prometheusremotewrite
  env:
    - name: GOMEMLIMIT
      value: 1640MiB # 80% of resources.limits.memory
  resources:
    requests:
      cpu: 200m
      memory: 512Mi
    limits:
      cpu: 500m
      memory: 2Gi

Frapschen commented 1 month ago

@duc12597 sorry for pinging you, there is a related issue for counter fluctuation, please see https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34126#issuecomment-2270690510 to fix it.

duc12597 commented 1 month ago

@duc12597 sorry for pinging you, there is a related issue for counter fluctuation, please see #34126 (comment) to fix it.

If I understand correctly, this will add a UUID as a label for every metric generated by each collector pod. Will this explode the cardinality? Why does a UUID solve the fluctuation? Can you give an example config?

Thanks a ton.

open-telemetry / opentelemetry-collector-contrib