[receiver/dockerstats] not generating per container metrics

schewara commented 5 months ago

Component(s)

exporter/prometheus, receiver/dockerstats, receiver/prometheus

What happened?

Description

We have a collector running (in docker), which is supposed to collect

docker container stats through receiver/dockerstats
app metrics, through receiver/prometheus and docker_sd_config
export the collected metrics trough exporter/prometheusexporter

A similar issue was already reported but was closed without any real solution -> #21247

Steps to Reproduce

have a container running which exposes metrics, like prometheus/node-exporter
start a otel/opentelemetry-collector-contrib container
observe the /metrics endpoint of the prometheus exporter

Expected Result

Individual metrics for each container running on the same host.

Actual Result

Only metrics which have Data point attributes are shown like the following, plus the metrics coming from the prometheus receiver.

container_network_io_usage_rx_errors_total{interface="eth1"} 0

Test scenarios and observations

`exporter/prometheus` - `resource_to_telemetry_conversion` - enabled

When enabling the config options, the following was observed

receiver/dockerstats metrics are available as expected
receiver/prometheus metrics are gone

I don't really know how the prometheus receiver converts the scraped metrics into an otel object, but it looks like that it creates individual metrics + a target_info metric only containing Data point attributes but no Resource attributes.

This would explain, that the metrics disappear, as from what it seems, all existing metric labels are wiped and replaced with nothing.

manually setting attribute labels

Trying to set manual static attributes through the attributes processor only added a new label, to the single metrics, but did not produce individual container metrics

After going through all the logs and searching through all the documentation I discovered the Setting resource attributes as metric labels section from the prometheus exporter, when implemented (see the commented out sections of the config), metrics from the dockerstats receiver showed up on the exporters /metrics endpoint, but are still missing some crucial labels, which might need to be added manually as well.

Findings

Based on all the observations during testing and trying things out, these are my takeaways for the current shortcomings of the 3 selected components and how they are not very good integrated with each other.

receiver/dockerstats

The received data from docker should be properly set up as either resource or datapoint attribute
The config settings or maybe just the documentation for the container_labels_to_metric_labels and env_vars_to_metric_labels settings is incorrect, as they are not added as a datapoint attribute and therefore never show up in any metric labels
For the metrics to work with prometheus, they should include a job and an instance label, by using the service.namespace,service.name,service.instance.id resource attributes, which then hopefully get picked up correctly by the exporter to convert it into the right label.

receiver/prometheus

I was under the impression, that the labels from the docker_sd_configs are added as resource attributes to the scraped metrics. But as I can't find the link to the source right now I am either mistaken or it just is not the case, looking at the log outputs and the target_info metrics.

exporter/prometheusexporter

Looking at the documentation and the target_info metric, I am missing the resource attributes from the dockerstats metrics. Maybe this is due to the missing service attributes or some other reason, but I was unable to see any errors or warnings in the standard log
The resource_to_telemetry_conversion functionality left me a bit speechless, that it wipes all datapoint attributes, especially when there are no resource attributes available. Also activating it would mean, that I would loose (as an example) the interface information from the container.network.io.usage.rx_bytes metric, without any idea from where the actual value is taken or calculated from. A warning in the documentation would be really helpful, or a flag to adjust the behavior based on individual needs.

Right now I am torn between manually transform all the labels of the dockerstats receiver, or create duplicate pipelines with a duplicated exporter, but either way there is some room for improvement to have everything working together smoothly.

Collector version

otel/opentelemetry-collector-contrib:0.101.0

Environment information

Environment

Docker

OpenTelemetry Collector configuration

receivers:
  docker_stats:
    api_version: '1.45'
    collection_interval: 10s
    container_labels_to_metric_labels:
      com.docker.compose.project: compose.project
      com.docker.compose.service: compose.service
    endpoint: "unix:///var/run/docker.sock"
    initial_delay: 1s
    metrics:
      container.restarts:
        enabled: true
      container.uptime:
        enabled: true
    timeout: 5s
  otlp:
    protocols:
      grpc: null
      http: null
  prometheus:
    config:
      global:
        scrape_interval: 30s
      scrape_configs:
      - job_name: otel-collector
        relabel_configs:
        - replacement: static.instance.name
          source_labels:
          - __address__
          target_label: instance
        scrape_interval: 30s
        static_configs:
        - targets:
          - localhost:8888
      - docker_sd_configs:
        - filters:
          - name: label
            values:
            - prometheus.scrape=true
          host: unix:///var/run/docker.sock
        job_name: docker-containers
        relabel_configs:
        - action: replace
          source_labels:
          - __meta_docker_container_label_prometheus_path
          target_label: __metrics_path__
        - action: replace
          regex: /(.*)
          source_labels:
          - __meta_docker_container_name
          target_label: container_name
        - action: replace
          separator: ':'
          source_labels:
          - container_name
          - __meta_docker_container_label_prometheus_port
          target_label: __address__
        - replacement: static.instance.name
          source_labels:
          - __address__
          target_label: instance
        - action: replace
          source_labels:
          - __meta_docker_container_id
          target_label: container_id
        - action: replace
          source_labels:
          - __meta_docker_container_id
          target_label: service_instance_id
        - action: replace
          source_labels:
          - __meta_docker_container_label_service_namespace
          target_label: service_namespace
        - action: replace
          source_labels:
          - container_name
          target_label: service_name
        - action: replace
          source_labels:
          - __meta_docker_container_label_deployment_environment
          target_label: deployment_environment
        - action: replace
          regex: (.+/)?/?(.+)
          replacement: $${1}$${2}
          separator: /
          source_labels:
          - service_namespace
          - service_name
          target_label: job
        scrape_interval: 30s

processors:
  batch: null
  resourcedetection/docker:
    detectors:
    - env
    - docker
    override: true
    timeout: 2s   
#  transform/dockerstats:
#    metric_statements:
#      - context: datapoint
#        statements:
#          - set(attributes["container.id"], resource.attributes["container.id"])
#          - set(attributes["container.name"], resource.attributes["container.name"])
#          - set(attributes["container.hostname"], resource.attributes["container.hostname"])
#          - set(attributes["host.name"], resource.attributes["host.name"])
#          - set(attributes["compose.project"], resource.attributes["compose.project"])
#          - set(attributes["compose.service"], resource.attributes["compose.service"])
#          - set(attributes["deployment.environment"], resource.attributes["deployment.environment"])
#          - set(attributes["service.namespace"], resource.attributes["service.namespace"])

service:
  pipelines:
    metrics:
      exporters:
      - prometheus
      - logging
      processors:
      # - transform/dockerstats
      - resourcedetection/docker
      - batch
      receivers:
      - otlp
      - docker_stats
      - prometheus

Log output

some snippets from individual metric log entries 

`nodexporter` metric through `receiver/prometheus` which contains Data point attributes but no Resource attributes

Metric #9
Descriptor:
     -> Name: node_disk_flush_requests_total
     -> Description: The total number of flush requests completed successfully
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> container_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> container_name: Str(nodeexporter)
     -> device: Str(sr0)
     -> service_instance_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> service_name: Str(nodeexporter)
StartTimestamp: 2024-05-29 19:03:04.033 +0000 UTC
Timestamp: 2024-05-29 19:03:04.033 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> container_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> container_name: Str(nodeexporter)
     -> device: Str(vda)
     -> service_instance_id: Str(47626dfb6da051ba7858bfa763297be5a20283b510a3faab46dd8a1f2f25210d)
     -> service_name: Str(nodeexporter)
StartTimestamp: 2024-05-29 19:03:04.033 +0000 UTC
Timestamp: 2024-05-29 19:03:04.033 +0000 UTC
Value: 6907671.000000

receiver/dockerstats metric with a Data point attribute, but no Resource attribute

Descriptor:
     -> Name: container.network.io.usage.rx_bytes
     -> Description: Bytes received by the container.
     -> Unit: By
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> interface: Str(eth0)
StartTimestamp: 2024-05-29 19:02:58.560036953 +0000 UTC
Timestamp: 2024-05-29 19:03:01.664890776 +0000 UTC
Value: 425806
NumberDataPoints #1
Data point attributes:
     -> interface: Str(eth1)
StartTimestamp: 2024-05-29 19:02:58.560036953 +0000 UTC
Timestamp: 2024-05-29 19:03:01.664890776 +0000 UTC
Value: 1631176

receiver/dockerstats metric with no Data point attribute, but Resource attributes

Metric #18
Descriptor:
     -> Name: container.uptime
     -> Description: Time elapsed since container start time.
     -> Unit: s
     -> DataType: Gauge
NumberDataPoints #0
StartTimestamp: 2024-05-29 19:02:58.560036953 +0000 UTC
Timestamp: 2024-05-29 19:03:01.664890776 +0000 UTC
Value: 5070.807996
ResourceMetrics #5
Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource attributes:
     -> container.runtime: Str(docker)
     -> container.hostname: Str(my-hostname)
     -> container.id: Str(cacdf88cadd7d8691efefbd0f0c49d256718830b89a3e47f6b65e8e7378534e6f)
     -> container.image.name: Str(my/test-container:0.0.9)
     -> container.name: Str(my-test-container)
     -> host.name: Str(my-hostname)
     -> os.type: Str(linux)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/dockerstatsreceiver 0.101.0



### Additional context

_No response_

github-actions[bot] commented 5 months ago

Pinging code owners:

receiver/dockerstats: @rmfitzpatrick @jamesmoessis
exporter/prometheus: @Aneurysm9
receiver/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jamesmoessis commented 5 months ago

I'm not sure I fully understand your issue, but it's seemingly not to do with the docker stats receiver. It just seems like the prometheus exporter isn't exporting what you expect, or there is some misunderstanding about what it makes available.

I think this would be more helpful if you identified one component that wasn't operating as expected. The docker stats receiver and the prometheus exporter have nothing to do with each other.

If your problem is that the docker stats receiver isn't reporting a metric that it should be, then it's a problem with the docker stats. If the prom exporter isn't doing what you think it should, that's a problem with the prom exporter (or a misconfiguration).

From what I can see it seems that the docker stats receiver is producing all of the information it should, and then the prom exporter is stripping some of the information that you expect. You can verify this by replacing the prom exporter with the debugexporter and see the output straight into stdout. If it's what you expect then you can narrow down the issue to the prom exporter.

Mendes11 commented 4 months ago

I'm experiencing the same issue, but my exporter is awsemf.

Using the debugger, I can see the labels in Resource Attributes, but it seems awsemf is just using the Data Point Attributes when sending the metrics, and it ignores what's in the Resource attributes?

 ResourceMetrics #4
 Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
 Resource attributes:
      -> container.runtime: Str(docker)
      -> container.hostname: Str(c3e61d730cb6)
      -> container.id: Str(c3e61d730cb6c5936b5862844d6e4acf60a880821610a7af9f9a689cffb966db)
      -> container.image.name: Str(couchdb:2.3.1@sha256:5c83dab4f1994ee4bb9529e9b1d282406054a1f4ad957d80df9e1624bdfb35d7)
      -> container.name: Str(swarmpit_db.1.usj3zlnoxmwjhjc27tc3g5he0)
      -> swarm_service: Str(swarmpit_db)
      -> swarm_container_id: Str(usj3zlnoxmwjhjc27tc3g5he0)
      -> swarm_namespace: Str(swarmpit)
 ScopeMetrics #0
 ScopeMetrics SchemaURL:
 InstrumentationScope otelcol/dockerstatsreceiver 1.0.0
 Metric #0
 Descriptor:
      -> Name: container.blockio.io_service_bytes_recursive
      -> Description: Number of bytes transferred to/from the disk by the group and descendant groups.
      -> Unit: By
      -> DataType: Sum
      -> IsMonotonic: true
      -> AggregationTemporality: Cumulative
 NumberDataPoints #0
 Data point attributes:
      -> device_major: Str(259)
      -> device_minor: Str(0)
      -> operation: Str(read)
 StartTimestamp: 2024-06-20 19:10:25.725911895 +0000 UTC
 Timestamp: 2024-06-20 19:19:28.761889055 +0000 UTC
 Value: 4366336
 NumberDataPoints #1
 Data point attributes:
      -> device_major: Str(259)
      -> device_minor: Str(0)
      -> operation: Str(write)
 StartTimestamp: 2024-06-20 19:10:25.725911895 +0000 UTC
 Timestamp: 2024-06-20 19:19:28.761889055 +0000 UTC
 Value: 4096

github-actions[bot] commented 2 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/dockerstats: @rmfitzpatrick @jamesmoessis
receiver/prometheus: @Aneurysm9 @dashpole
exporter/prometheus: @Aneurysm9

See Adding Labels via Comments if you do not have permissions to add labels yourself.

open-telemetry / opentelemetry-collector-contrib