open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.72k stars 2.16k forks source link

Add more detailed error message when `initPrometheusComponent` failed #33828

Open chenlujjj opened 2 days ago

chenlujjj commented 2 days ago

Component(s)

receiver/prometheus

Describe the issue you're reporting

We encountered the following problem when using prometheus receiver to scrape metrics from pods: image

The error message didn't provide enough information of why it failed. Actually it is caused by the prometheus discovery library which doesn’t expose the low-level error why Register fails. Wondering any ways to improve the error to make debug easier.

github-actions[bot] commented 2 days ago

Pinging code owners:

crobert-1 commented 2 days ago

It looks like the error itself is frequency of https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32123, @chenlujjj can you share what version of the collector and Prometheus you're using?

It would be good to confirm you're hitting this before the underlying issue was fixed, in addition to adding more detailed error messaging.

chenlujjj commented 2 days ago

Hi @crobert-1 , we are using splunk-otel-collector v0.97.0, and the prometheus library it depends is github.com/prometheus/prometheus v0.50.1.

The receiver configuration is:

receiver_creator/application:
    receivers:
      prometheus_simple/app_pods:
        rule: type == "port" && pod.annotations["prometheus.io/scrape"] == "true" && ( string(port) == pod.annotations["prometheus.io/port"] || name == pod.annotations["prometheus.io/port"] )
        config:
          endpoint: "`endpoint`"
          metrics_path: '`"prometheus.io/path" in pod.annotations ? pod.annotations["prometheus.io/path"] : "/metrics"`'
          collection_interval: '`"prometheus.io/collection_interval" in pod.annotations ? pod.annotations["prometheus.io/collection_interval"] : "30s"`'
chenlujjj commented 2 days ago

After restarting collector process, the issue was gone temporarily