open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.99k stars 2.31k forks source link

[receiver/dockerstats] not generating per container metrics #33303

Open schewara opened 4 months ago

schewara commented 4 months ago

Component(s)

exporter/prometheus, receiver/dockerstats, receiver/prometheus

What happened?

Description

We have a collector running (in docker), which is supposed to collect

A similar issue was already reported but was closed without any real solution -> #21247

Steps to Reproduce

  1. have a container running which exposes metrics, like prometheus/node-exporter
  2. start a otel/opentelemetry-collector-contrib container
  3. observe the /metrics endpoint of the prometheus exporter

Expected Result

Individual metrics for each container running on the same host.

Actual Result

Only metrics which have Data point attributes are shown like the following, plus the metrics coming from the prometheus receiver.

container_network_io_usage_rx_errors_total{interface="eth1"} 0

Test scenarios and observations

exporter/prometheus - resource_to_telemetry_conversion - enabled

When enabling the config options, the following was observed

I don't really know how the prometheus receiver converts the scraped metrics into an otel object, but it looks like that it creates individual metrics + a target_info metric only containing Data point attributes but no Resource attributes.

This would explain, that the metrics disappear, as from what it seems, all existing metric labels are wiped and replaced with nothing.

manually setting attribute labels

Trying to set manual static attributes through the attributes processor only added a new label, to the single metrics, but did not produce individual container metrics

After going through all the logs and searching through all the documentation I discovered the Setting resource attributes as metric labels section from the prometheus exporter, when implemented (see the commented out sections of the config), metrics from the dockerstats receiver showed up on the exporters /metrics endpoint, but are still missing some crucial labels, which might need to be added manually as well.

Findings

Based on all the observations during testing and trying things out, these are my takeaways for the current shortcomings of the 3 selected components and how they are not very good integrated with each other.

receiver/dockerstats

receiver/prometheus

exporter/prometheusexporter


Right now I am torn between manually transform all the labels of the dockerstats receiver, or create duplicate pipelines with a duplicated exporter, but either way there is some room for improvement to have everything working together smoothly.

Collector version

otel/opentelemetry-collector-contrib:0.101.0

Environment information

Environment

Docker

OpenTelemetry Collector configuration

receivers:
  docker_stats:
    api_version: '1.45'
    collection_interval: 10s
    container_labels_to_metric_labels:
      com.docker.compose.project: compose.project
      com.docker.compose.service: compose.service
    endpoint: "unix:///var/run/docker.sock"
    initial_delay: 1s
    metrics:
      container.restarts:
        enabled: true
      container.uptime:
        enabled: true
    timeout: 5s
  otlp:
    protocols:
      grpc: null
      http: null
  prometheus:
    config:
      global:
        scrape_interval: 30s
      scrape_configs:
      - job_name: otel-collector
        relabel_configs:
        - replacement: static.instance.name
          source_labels:
          - __address__
          target_label: instance
        scrape_interval: 30s
        static_configs:
        - targets:
          - localhost:8888
      - docker_sd_configs:
        - filters:
          - name: label
            values:
            - prometheus.scrape=true
          host: unix:///var/run/docker.sock
        job_name: docker-containers
        relabel_configs:
        - action: replace
          source_labels:
          - __meta_docker_container_label_prometheus_path
          target_label: __metrics_path__
        - action: replace
          regex: /(.*)
          source_labels:
          - __meta_docker_container_name
          target_label: container_name
        - action: replace
          separator: ':'
          source_labels:
          - container_name
          - __meta_docker_container_label_prometheus_port
          target_label: __address__
        - replacement: static.instance.name
          source_labels:
          - __address__
          target_label: instance
        - action: replace
          source_labels:
          - __meta_docker_container_id
          target_label: container_id
        - action: replace
          source_labels:
          - __meta_docker_container_id
          target_label: service_instance_id
        - action: replace
          source_labels:
          - __meta_docker_container_label_service_namespace
          target_label: service_namespace
        - action: replace
          source_labels:
          - container_name
          target_label: service_name
        - action: replace
          source_labels:
          - __meta_docker_container_label_deployment_environment
          target_label: deployment_environment
        - action: replace
          regex: (.+/)?/?(.+)
          replacement: $${1}$${2}
          separator: /
          source_labels:
          - service_namespace
          - service_name
          target_label: job
        scrape_interval: 30s

processors:
  batch: null
  resourcedetection/docker:
    detectors:
    - env
    - docker
    override: true
    timeout: 2s   
#  transform/dockerstats:
#    metric_statements:
#      - context: datapoint
#        statements:
#          - set(attributes["container.id"], resource.attributes["container.id"])
#          - set(attributes["container.name"], resource.attributes["container.name"])
#          - set(attributes["container.hostname"], resource.attributes["container.hostname"])
#          - set(attributes["host.name"], resource.attributes["host.name"])
#          - set(attributes["compose.project"], resource.attributes["compose.project"])
#          - set(attributes["compose.service"], resource.attributes["compose.service"])
#          - set(attributes["deployment.environment"], resource.attributes["deployment.environment"])
#          - set(attributes["service.namespace"], resource.attributes["service.namespace"])

service:
  pipelines:
    metrics:
      exporters:
      - prometheus
      - logging
      processors:
      # - transform/dockerstats
      - resourcedetection/docker
      - batch
      receivers:
      - otlp
      - docker_stats
      - prometheus

Log output

some snippets from individual metric log entries 

`nodexporter` metric through `receiver/prometheus` which contains Data point attributes but no Resource attributes

Metric #9
Descriptor:
     -> Name: node_disk_flush_requests_total
     -> Description: The total number of flush requests completed successfully
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> container_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> container_name: Str(nodeexporter)
     -> device: Str(sr0)
     -> service_instance_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> service_name: Str(nodeexporter)
StartTimestamp: 2024-05-29 19:03:04.033 +0000 UTC
Timestamp: 2024-05-29 19:03:04.033 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> container_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> container_name: Str(nodeexporter)
     -> device: Str(vda)
     -> service_instance_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> service_name: Str(nodeexporter)
StartTimestamp: 2024-05-29 19:03:04.033 +0000 UTC
Timestamp: 2024-05-29 19:03:04.033 +0000 UTC
Value: 6907671.000000

receiver/dockerstats metric with a Data point attribute, but no Resource attribute

Descriptor:
     -> Name: container.network.io.usage.rx_bytes
     -> Description: Bytes received by the container.
     -> Unit: By
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> interface: Str(eth0)
StartTimestamp: 2024-05-29 19:02:58.560036953 +0000 UTC
Timestamp: 2024-05-29 19:03:01.664890776 +0000 UTC
Value: 425806
NumberDataPoints #1
Data point attributes:
     -> interface: Str(eth1)
StartTimestamp: 2024-05-29 19:02:58.560036953 +0000 UTC
Timestamp: 2024-05-29 19:03:01.664890776 +0000 UTC
Value: 1631176

receiver/dockerstats metric with no Data point attribute, but Resource attributes

Metric #18
Descriptor:
     -> Name: container.uptime
     -> Description: Time elapsed since container start time.
     -> Unit: s
     -> DataType: Gauge
NumberDataPoints #0
StartTimestamp: 2024-05-29 19:02:58.560036953 +0000 UTC
Timestamp: 2024-05-29 19:03:01.664890776 +0000 UTC
Value: 5070.807996
ResourceMetrics #5
Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource attributes:
     -> container.runtime: Str(docker)
     -> container.hostname: Str(my-hostname)
     -> container.id: Str(cacdf88cadd7d8691efefbd0f0c49d256718830b89a3e47f6b65e8e7378534e6f)
     -> container.image.name: Str(my/test-container:0.0.9)
     -> container.name: Str(my-test-container)
     -> host.name: Str(my-hostname)
     -> os.type: Str(linux)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/dockerstatsreceiver 0.101.0


### Additional context

_No response_
github-actions[bot] commented 4 months ago

Pinging code owners:

jamesmoessis commented 4 months ago

I'm not sure I fully understand your issue, but it's seemingly not to do with the docker stats receiver. It just seems like the prometheus exporter isn't exporting what you expect, or there is some misunderstanding about what it makes available.

I think this would be more helpful if you identified one component that wasn't operating as expected. The docker stats receiver and the prometheus exporter have nothing to do with each other.

If your problem is that the docker stats receiver isn't reporting a metric that it should be, then it's a problem with the docker stats. If the prom exporter isn't doing what you think it should, that's a problem with the prom exporter (or a misconfiguration).

From what I can see it seems that the docker stats receiver is producing all of the information it should, and then the prom exporter is stripping some of the information that you expect. You can verify this by replacing the prom exporter with the debugexporter and see the output straight into stdout. If it's what you expect then you can narrow down the issue to the prom exporter.

Mendes11 commented 3 months ago

I'm experiencing the same issue, but my exporter is awsemf.

Using the debugger, I can see the labels in Resource Attributes, but it seems awsemf is just using the Data Point Attributes when sending the metrics, and it ignores what's in the Resource attributes?

 ResourceMetrics #4
 Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
 Resource attributes:
      -> container.runtime: Str(docker)
      -> container.hostname: Str(c3e61d730cb6)
      -> container.id: Str(c3e61d730cb6c5936b5862844d6e4acf60a880821610a7af9f9a689cffb966db)
      -> container.image.name: Str(couchdb:2.3.1@sha256:5c83dab4f1994ee4bb9529e9b1d282406054a1f4ad957d80df9e1624bdfb35d7)
      -> container.name: Str(swarmpit_db.1.usj3zlnoxmwjhjc27tc3g5he0)
      -> swarm_service: Str(swarmpit_db)
      -> swarm_container_id: Str(usj3zlnoxmwjhjc27tc3g5he0)
      -> swarm_namespace: Str(swarmpit)
 ScopeMetrics #0
 ScopeMetrics SchemaURL:
 InstrumentationScope otelcol/dockerstatsreceiver 1.0.0
 Metric #0
 Descriptor:
      -> Name: container.blockio.io_service_bytes_recursive
      -> Description: Number of bytes transferred to/from the disk by the group and descendant groups.
      -> Unit: By
      -> DataType: Sum
      -> IsMonotonic: true
      -> AggregationTemporality: Cumulative
 NumberDataPoints #0
 Data point attributes:
      -> device_major: Str(259)
      -> device_minor: Str(0)
      -> operation: Str(read)
 StartTimestamp: 2024-06-20 19:10:25.725911895 +0000 UTC
 Timestamp: 2024-06-20 19:19:28.761889055 +0000 UTC
 Value: 4366336
 NumberDataPoints #1
 Data point attributes:
      -> device_major: Str(259)
      -> device_minor: Str(0)
      -> operation: Str(write)
 StartTimestamp: 2024-06-20 19:10:25.725911895 +0000 UTC
 Timestamp: 2024-06-20 19:19:28.761889055 +0000 UTC
 Value: 4096
github-actions[bot] commented 1 month ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.