open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.33k forks source link

[prometheusexporter] setting resource_to_telemetry_conversion enabled config to true causes duplicate label names error on scraping attempt #10374

Closed tim-mwangi closed 1 year ago

tim-mwangi commented 2 years ago

Describe the bug Since in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/9115 we are now adding job and instance labels based on service name(and namespace if present) and service instance id in the resource attributes respectively, setting resource_to_telemetry_conversion enabled config to true would copy already existing job and instance resource attributes into metric datapoint attributes and hence cause duplicates when the code in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/9115 is executed.

# prometheus exporter config
prometheus:
  endpoint: "0.0.0.0:8889"
  resource_to_telemetry_conversion:
    enabled: true

When I stepped through using a debugger the duplicates were on job for my test scenario. When I set resource_to_telemetry_conversion enabled config to false there are no errors but of course I miss out on having the resource attributes copied over to the metric datapoint attributes. Here's an example of the metrics coming in to the collector.

2022-05-27T07:24:28.308-0700    DEBUG   loggingexporter/logging_exporter.go:64  ResourceMetrics #0
Resource SchemaURL: 
Resource labels:
     -> service.name: STRING(my-collector)
     -> job: STRING(my-collector)
     -> instance: STRING(0.0.0.0:8888)
     -> port: STRING(8888)
     -> scheme: STRING(http)
InstrumentationLibraryMetrics #0
InstrumentationLibraryMetrics SchemaURL: 
InstrumentationLibrary  
Metric #0
Descriptor:
     -> Name: otelcol_exporter_enqueue_failed_spans
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
NumberDataPoints #0
Data point attributes:
     -> exporter: STRING(logging)
     -> service_instance_id: STRING(355de792-ba24-4e05-98ac-699455b25ac4)
     -> service_version: STRING(latest)
StartTimestamp: 2022-05-27 14:24:28.154 +0000 UTC
Timestamp: 2022-05-27 14:24:28.154 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> exporter: STRING(otlp)
     -> service_instance_id: STRING(355de792-ba24-4e05-98ac-699455b25ac4)
     -> service_version: STRING(latest)
StartTimestamp: 2022-05-27 14:24:28.154 +0000 UTC
Timestamp: 2022-05-27 14:24:28.154 +0000 UTC
Value: 0.000000

Steps to reproduce Our setup chains 2 open-telemetry collectors: let's call them A and B where A sends both traces and metrics to B via the otlp exporter. A has enabled the prometheus receiver and on it gets the internal open-telemetry collector metrics and puts them on its pipeline and eventually exports them to B via otlp. B has turned on the prometheus exporter with resource_to_telemetry_conversion enabled config set to true

# prometheus receiver config in collector A to collect own metrics
prometheus:
  config:
    scrape_configs:
      - job_name: "my-collector"
        scrape_interval: 10s
        static_configs:
          - targets: ["0.0.0.0:8888"]

These metrics have job and instance added as resource attributes. This is our setup but it is easy to replicate by just having metrics coming into the exporter that have these resource attributes and setting resource_to_telemetry_conversion enabled config to true.

What did you expect to see? Metrics to be successfully scraped by the prometheus server and no error logs.

What did you see instead? Metrics were not scraped and these error logs continuously logging:

2022-05-25T21:05:46.779Z  error prometheusexporter@v0.49.0/collector.go:243 failed to convert metric otelcol_process_memory_rss: duplicate label names  {"kind": "exporter", "name": "prometheus"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*collector).Collect
  /go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.49.0/collector.go:243
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1
  /go/pkg/mod/github.com/prometheus/client_golang@v1.12.1/prometheus/registry.go:448
2022-05-25T21:05:46.779Z  error prometheusexporter@v0.49.0/collector.go:243 failed to convert metric otelcol_process_runtime_heap_alloc_bytes: duplicate label names  {"kind": "exporter", "name": "prometheus"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*collector).Collect
  /go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.49.0/collector.go:243
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1
  /go/pkg/mod/github.com/prometheus/client_golang@v1.12.1/prometheus/registry.go:448

What version did you use? Version: v0.49.0 Open telemetry collector version: v0.49.0 Prometheus server version: v2.36.0

What config did you use? open-telemetry collector config

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:55681"
  opencensus:
    endpoint: "0.0.0.0:55678"
  zipkin:
    endpoint: "0.0.0.0:9411"
  jaeger:
    protocols:
      grpc:
        endpoint: "0.0.0.0:14250"
      thrift_http:
        endpoint: "0.0.0.0:14268"
processors:
  batch: {}

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    # For converting resource attributes to metric labels
    resource_to_telemetry_conversion:
      enabled: true
  logging:
    loglevel: error
    sampling_initial: 20
    sampling_thereafter: 1

service:
  telemetry:
    logs:
      level: "INFO"
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp, opencensus, jaeger, zipkin]
      processors: [batch]
      exporters: [logging]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

prometheus config:

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093
rule_files:
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:8889"]

Environment OS: Mac OS X 12.3.1(Intel), gcr.io/distroless/base based image as well Compiler(if manually compiled): go 1.17 used to compile our open-telemetry collector.

Additional context Add any other context about the problem here.

dmitryax commented 2 years ago

cc @Aneurysm9 as code owner

github-actions[bot] commented 2 years ago

Pinging code owners: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Aneurysm9 commented 1 year ago

See also #14900.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.