open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.71k stars 2.14k forks source link

[receiver/prometheus] honor_labels set to true and scraping a prometheus pushgateway not working #33742

Closed paebersold-tyro closed 1 day ago

paebersold-tyro commented 4 days ago

Component(s)

receiver/prometheus

What happened?

Description

Scraping a Prometheus pushgateway with honor_labels: true results in a scrape endpoint failure. Suspect this is due to the scrape metrics having both instance and jobs labels (from https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/15239) but would like clarification that this is the problem. Also is there any work around (other than setting honor_labels: false). Attempted doing a label drop with metric_relabel_config but that did not work.

Steps to Reproduce

Prometheus receiver config

          - job_name: test-pushgateway
            scrape_interval: 30s
            scrape_timeout: 10s
            honor_labels: true
            scheme: http
            kubernetes_sd_configs:
            - role: pod
              namespaces:
                names:
                - app-platform-monitoring
            relabel_configs:
            # and pod is running
            - source_labels: [__meta_kubernetes_pod_phase]
              regex: Running
              action: keep
            # and pod is ready
            - source_labels: [__meta_kubernetes_pod_ready]
              regex: true
              action: keep
            # and only metrics endpoints
            - source_labels: [__meta_kubernetes_pod_container_port_name]
              action: keep
              regex: metrics

Expected Result

Endpoint is scraped, job and instances labels from the pushgateway are used.

Actual Result

Endpoint scrape failure (see log message below)

Collector version

0.102.0

Environment information

Environment

OS: Kubernetes 1.29

OpenTelemetry Collector configuration

receiver:
    prometheus:
      config:
          - job_name: test-pushgateway
            scrape_interval: 30s
            scrape_timeout: 10s
            honor_labels: true
            scheme: http
            kubernetes_sd_configs:
            - role: pod
              namespaces:
                names:
                - app-platform-monitoring
            relabel_configs:
            # and pod is running
            - source_labels: [__meta_kubernetes_pod_phase]
              regex: Running
              action: keep
            # and pod is ready
            - source_labels: [__meta_kubernetes_pod_ready]
              regex: true
              action: keep
            # and only metrics endpoints
            - source_labels: [__meta_kubernetes_pod_container_port_name]
              action: keep
              regex: metrics
exporter:
  debug: {}
service:
  pipeline:
    metrics:
      exporters: [debug]
      processors: []
      receivers: [prometheus]

Log output

2024-06-24T06:20:36.193Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1719210036190, "target_labels": "{__name__=\"up\", instance=\"10.18.67.171:9091\", job=\"test-pushgateway\"}"}

Additional context

sample of metrics that are returned from the pushgateway

app_platform_attestation{feature="coredns",instance="",job="cluster",team="bob",test="TestCoreDNSNameResolution"} 1
app_platform_attestation{feature="coredns",instance="",job="cluster",team="bob",test="TestIsCoreDNSDeployed"} 1
app_platform_attestation{feature="coredns",instance="",job="cluster",team="bob",test="TestIsCoreDNSServiceAvailable"} 1
push_failure_time_seconds{feature="coredns",instance="",job="cluster"} 0
push_time_seconds{feature="coredns",instance="",job="cluster"} 1.7192055849949868e+09
github-actions[bot] commented 4 days ago

Pinging code owners:

dashpole commented 3 days ago

Can you set the log level of the collector to debug to see the detailed error message for why the scrape failed?

dashpole commented 3 days ago

I think it should be:

service:
    logs:
        level: DEBUG
paebersold-tyro commented 2 days ago

Hello, debug log output (seems the empty instance label may be the issue as suspected)

2024-06-27T01:40:49.045Z    debug   scrape/scrape.go:1650   Unexpected error    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "test-pushgateway", "target": "http://10.18.67.95:9091/metrics", "series": "app_platform_attestation{feature=\"coredns\",instance=\"\",job=\"cluster\",team=\"bob\",test=\"TestCoreDNSNameResolution\"}", "error": "job or instance cannot be found from labels"}
2024-06-27T01:40:49.045Z    debug   scrape/scrape.go:1346   Append failed   {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "test-pushgateway", "target": "http://10.18.67.95:9091/metrics", "error": "job or instance cannot be found from labels"}
2024-06-27T01:40:49.045Z    warn    internal/transaction.go:125 Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1719452449041, "target_labels": "{__name__=\"up\", instance=\"10.18.67.95:9091\", job=\"test-pushgateway\"}"}
dashpole commented 1 day ago

This should've been fixed by https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33565. Can you try upgrading to v0.103.0?

paebersold-tyro commented 1 day ago

Thank you for that, 0.103.0 fixed the issue.