open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.33k forks source link

telemetry receiver prometheus down after some minutes #30835

Closed carlosmt86-hub closed 8 months ago

carlosmt86-hub commented 8 months ago

Component(s)

No response

What happened?

Description

After some minutes (30-45) telemetry metrics stop working we can see this error doing a curl -v 127.0.0.1:8888/metrics: collected metric "otelcol_exporter_queue_size" { label:{name:"exporter" value:"datadog"} label:{name:"service_instance_id" value:"a2cdae11-0e8f-4b05-a31a-9de31429d8a7"} label:{name:"service_name" value:"otelcol-contrib"} label:{name:"service_version" value:"0.93.0"} gauge:{value:0}} was collected before with the same name and label values

On the logs I have: warn internal/transaction.go:123 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1706557569579, "target_labels": "{name=\"up\", instance=\"127.0.0.1:8888\", job=\"otel-collector\"}"}

Steps to Reproduce

Enable telemetry and scrape it with Prometheus receiver, after some minutes stop working

Expected Result

Telemetry metrics continue working.

Actual Result

Telemetry metrics stop working after some minutes.

Collector version

v0.93.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 60s
          static_configs:
            - targets: ['127.0.0.1:8888']

processors:
  batch:
    timeout: 1s
  memory_limiter:
    check_interval: 1s
    limit_mib: 200

service:
   pipelines:
      metrics/prometheus:
         receivers: [prometheus]
         processors: [memory_limiter]
         exporters: [datadog]

    telemetry:
      metrics:
         address: 0.0.0.0:8888

Log output

warn    internal/transaction.go:123     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1706557569579, "target_labels": "{__name__=\"up\", instance=\"127.0.0.1:8888\", job=\"otel-collector\"}"}

Additional context

No response

neeej commented 8 months ago

I can confirm I also see the same issue on version 0.93.0. I don't scrape the metrics in the same way in otelc though, as I use telegraf for that. My metrics stops just after a couple of minutes, tops.

This worked fine on version 0.91.0 (0.92.0 doesn't work with AWS ALB due to a bug in grpc-go so not sure if it was a problem in that version or not)

I get the same error from the metrics endpoint:

$ curl http://localhost:8888/metrics
An error has occurred while serving metrics:

collected metric "otelcol_exporter_queue_size" { label:{name:"exporter"  value:"loadbalancing"}  label:{name:"service_instance_id"  value:"5fa99331-b105-4dc7-a62d-b850fa048393"}  label:{name:"service_name"  value:"otelcol-contrib"}  label:{name:"service_version"  value:"0.93.0"}  gauge:{value:0}} was collected before with the same name and label values

I run this on AL2023, (AWS EC2 arm instance) with this config:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  loadbalancing:
    protocol:
      otlp:
        timeout: 5s
        sending_queue:
          queue_size: 10000
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otelsampler.etc
        timeout: 3s

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 30
  batch:
    send_batch_max_size: 12288
    timeout: 5s

extensions:
  health_check:
  zpages:

service:
  extensions: [zpages, health_check]
  pipelines:
    traces/otlp:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]
    metrics:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]
    logs:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]

I do however have other collectors running which above collectors send to, for testing out sampling, which also runs in a similar way, and on version 0.93.0, where the metrics still works fine. It has this config:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  file/otlpdebug:
    path: /tmp/otlpdebug
    rotation:
      max_megabytes: 10
      max_days: 2
      max_backups: 4
      localtime: true
  otlp/apmserver:
    endpoint: "https://a.apm.etc:8200"
    retry_on_failure:
      max_elapsed_time: 1000s
    sending_queue:
      queue_size: 5000
    timeout: 10s
    headers:
      Authorization: "Bearer NA"

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 30
  batch:
    send_batch_size: 5120
    send_batch_max_size: 5120
    timeout: 5s
  tail_sampling:
    decision_wait: 10s
    num_traces: 20000
    expected_new_traces_per_sec: 1000
    policies: [
      {
        name: errors-policy,
        type: status_code,
        status_code: { status_codes: [ERROR] }
      },
      {
        name: latency-policy,
        type: latency,
        latency: { threshold_ms: 500 }
      },
      {
        name: randomized-policy,
        type: probabilistic,
        probabilistic: { sampling_percentage: 10 }
      },
      {
        # Always sample if the force_sample attribute is set to true
        name: force-sample-policy,
        type: boolean_attribute,
        boolean_attribute: { key: force_sample, value: true }
      },
    ]

extensions:
  health_check:
  zpages:

service:
  extensions: [zpages, health_check]
  pipelines:
    traces/otlp:
      receivers: [otlp]
      exporters: [file/otlpdebug, otlp/apmserver]
      processors: [memory_limiter, tail_sampling, batch]
    metrics:
      receivers: [otlp]
      exporters: [otlp/apmserver]
      processors: [memory_limiter, batch]
    logs:
      receivers: [otlp]
      exporters: [otlp/apmserver]
      processors: [memory_limiter, batch]
juissi-t commented 8 months ago

@carlosmt86-hub Which PR was this issue fixed with? I'm seeing a similar problem with version 0.92.