open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.33k forks source link

docker_stats: no per-container stats ? #21247

Closed toxic0berliner closed 1 year ago

toxic0berliner commented 1 year ago

Component(s)

receiver/dockerstats

Describe the issue you're reporting

I'm not recieving any stats on individual containers.

I started otel-collector using following compose settings :

docker-compose.yml ``` version: "3.9" services: otel-collector: image: otel/opentelemetry-collector-contrib hostname: otel-collector container_name: otel-collector command: ["--config=/etc/otel-collector-config.yml"] user: "0" # needed to access docker.sock labels: traefik.http.services.otel-collector.loadbalancer.server.port: 8889 traefik.http.routers.otel-collector.service: otel-collector volumes: - ./otel-collector-config.yml:/etc/otel-collector-config.yml - /var/run/docker.sock:/var/run/docker.sock:rw - /var/lib/docker/containers:/var/lib/docker/containers:ro ports: # - 1888:1888 # pprof extension # - 8888:8888 # Prometheus metrics exposed by the collector - 8889:8889 # Prometheus exporter metrics # - 13133:13133 # health_check extension # - 4317:4317 # OTLP gRPC receiver # - 4318:4318 # OTLP http receiver # - 55679:55679 # zpages extension ```

And this is my otel-collector-config.yml :

otel-collector-config.yml ``` receivers: otlp: protocols: grpc: docker_stats: endpoint: "unix:///var/run/docker.sock" api_version: 1.41 collection_interval: 2s timeout: 20s env_vars_to_metric_labels: - com.docker.compose.project: container_stack_name - com.docker.compose.service: container_service_name exporters: prometheus: endpoint: "0.0.0.0:8889" logging: processors: batch: extensions: health_check: pprof: endpoint: :1888 zpages: endpoint: :55679 service: telemetry: logs: level: "debug" extensions: [pprof, zpages, health_check] pipelines: metrics: receivers: [docker_stats] processors: [] exporters: [logging, prometheus] ```

All I'm getting under http://myhost:8080/metrics is this:

/metrics ``` # HELP container_cpu_percent Percent of CPU used by the container. # TYPE container_cpu_percent gauge container_cpu_percent 0.4280678815489749 # HELP container_cpu_usage_kernelmode Time spent by tasks of the cgroup in kernel mode (Linux). Time spent by all container processes in kernel mode (Windows). # TYPE container_cpu_usage_kernelmode counter container_cpu_usage_kernelmode 2.4254e+11 # HELP container_cpu_usage_total Total CPU time consumed. # TYPE container_cpu_usage_total counter container_cpu_usage_total 6.95768776915e+11 # HELP container_cpu_usage_usermode Time spent by tasks of the cgroup in user mode (Linux). Time spent by all container processes in user mode (Windows). # TYPE container_cpu_usage_usermode counter container_cpu_usage_usermode 4.2274e+11 # HELP container_memory_percent Percentage of memory used. # TYPE container_memory_percent gauge container_memory_percent 0.2483825322600396 # HELP container_memory_total_cache Total amount of memory used by the processes of this cgroup (and descendants) that can be associated with a block on a block device. Also accounts for memory used by tmpfs. # TYPE container_memory_total_cache gauge container_memory_total_cache 5.5816192e+07 # HELP container_memory_usage_limit Memory limit of the container. # TYPE container_memory_usage_limit gauge container_memory_usage_limit 1.6624267264e+10 # HELP container_memory_usage_total Memory usage of the container. This excludes the total cache. # TYPE container_memory_usage_total gauge container_memory_usage_total 4.1291776e+07 # HELP container_network_io_usage_rx_bytes Bytes received by the container. # TYPE container_network_io_usage_rx_bytes counter container_network_io_usage_rx_bytes{interface="eth0"} 3.728707676e+09 container_network_io_usage_rx_bytes{interface="eth1"} 8.354612e+06 # HELP container_network_io_usage_rx_dropped Incoming packets dropped. # TYPE container_network_io_usage_rx_dropped counter container_network_io_usage_rx_dropped{interface="eth0"} 0 container_network_io_usage_rx_dropped{interface="eth1"} 0 # HELP container_network_io_usage_tx_bytes Bytes sent. # TYPE container_network_io_usage_tx_bytes counter container_network_io_usage_tx_bytes{interface="eth0"} 3.734452545e+09 container_network_io_usage_tx_bytes{interface="eth1"} 1.1080344e+07 # HELP container_network_io_usage_tx_dropped Outgoing packets dropped. # TYPE container_network_io_usage_tx_dropped counter container_network_io_usage_tx_dropped{interface="eth0"} 0 container_network_io_usage_tx_dropped{interface="eth1"} 0 ```

As such, I'm not seeing each container's CPU usage and so on...

I do see in the otel-collector container logs that it's supposed to be fetching them :

otel-collector logs ``` otel-collector | 2023-04-28T13:55:41.186Z debug prometheusexporter@v0.75.0/accumulator.go:90 accumulating metric: container.memory.usage.limit {"kind": "exporter", "data_type": "metrics", " name": "prometheus"} otel-collector | 2023-04-28T13:55:41.186Z debug prometheusexporter@v0.75.0/accumulator.go:90 accumulating metric: container.memory.usage.total {"kind": "exporter", "data_type": "metrics", " name": "prometheus"} otel-collector | 2023-04-28T13:55:41.186Z debug docker@v0.75.0/docker.go:162 Fetching container stats. {"kind": "receiver", "name": "docker_stats", "data_type": "metrics", "id": "8dbb3870e5 8153a523c2a83f48561b3a8cf11d55a07c93520b4bea5f5a96a3ca"} otel-collector | 2023-04-28T13:55:41.186Z debug docker@v0.75.0/docker.go:162 Fetching container stats. {"kind": "receiver", "name": "docker_stats", "data_type": "metrics", "id": "b94f3e8a1e 3acaf725f6bdd875b622e2433249aa15e86c909e55fb26f02177bc"} ```

I think I've solved the access issues with user: "0" since I'm not seeing any Permission denied anymore...

Can someone help ? Or am I expecting to see stats for each container where this collector only gets overall system stats ?

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jamesmoessis commented 1 year ago

Hi @toxic0berliner the container information is part of the resource attributes, which is at a "higher level". I'm not sure how the prom exporter converts that or where it displays that information.

I'd be interested to see the output of a detailed loglevel in the logging exporter. That would tell me the exact metrics that are being exported, and not manipulated by the prom exporter.

Can you post the output with the following config?

exporters:
  logging:
    verbosity: detailed
toxic0berliner commented 1 year ago

wow, got some new logs with this !!!

here they are : ``` 2023-05-01T21:54:09.403Z debug docker@v0.76.3/docker.go:162 Fetching container stats. {"kind": "receiver", "name": "docker_stats", "data_type": "metrics", "id": "3d3c60710dc3c6bc743ae4b9be 40c5de1764093ed65b3b815dbfcdf8228ddc78"} 2023-05-01T21:54:09.542Z info MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 27, "metrics": 320, "data points": 324} 2023-05-01T21:54:09.546Z info ResourceMetrics #0 Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1 Resource attributes: -> container.image.name: Str(hotio/bazarr) -> container.name: Str(bazarr) ScopeMetrics #0 InstrumentationScope otelcol/dockerstatsreceiver 0.76.3 Metric #0 Descriptor: -> Unit: 1 -> DataType: Gauge NumberDataPoints #0 Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Value: 0.192984 Metric #1 Descriptor: -> Name: container.cpu.usage.kernelmode -> Unit: ns -> DataType: Sum NumberDataPoints #0 StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Descriptor: -> Name: container.cpu.usage.total -> Description: Total CPU time consumed. StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC -> Description: Time spent by tasks of the cgroup in user mode (Linux). Time spent by all container processes in user mode (Windows). -> Unit: ns StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC -> Description: Percentage of memory used. Authenticating with public key "virgo" from agent ┌──────────────────────────────────────────────────────────────────────┐ │ • MobaXterm Personal Edition v22.2 • │ │ (SSH client, X server and network tools) │ │ │ │ ⮞ SSH session to toxic@orion.lan │ │ • Direct SSH : ✓ │ │ • SSH compression : ✓ │ │ • SSH-browser : ✓ │ │ • X11-forwarding : ✗ (disabled or not supported by server) │ │ │ │ ⮞ For more info, ctrl+click on help or visit our website. │ └──────────────────────────────────────────────────────────────────────┘ Synology strongly advises you not to run commands as the root user, who has the highest privileges on the system. Doing so may cause major damages to the system. Please note that if you choose to proceed, all consequences are at your own risk. -> Description: Total CPU time consumed. [1770/1955] StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC -> Description: Time spent by tasks of the cgroup in user mode (Linux). Time spent by all container processes in user mode (Windows). -> Unit: ns StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC -> Description: Percentage of memory used. -> Unit: 1 -> DataType: Gauge -> Description: Total amount of memory used by the processes of this cgroup (and descendants) that can be associated with a block on a block device. Also accounts for memory used by tmpfs. -> Unit: By -> DataType: Sum Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Value: 32083968 Metric #6 -> DataType: Sum -> IsMonotonic: false -> AggregationTemporality: Cumulative NumberDataPoints #0 Value: 16624267264 Metric #7 Descriptor: NumberDataPoints #0 StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Metric #8 Descriptor: Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Value: 37894553 -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC -> Name: container.memory.total_cache -> Description: Total amount of memory used by the processes of this cgroup (and descendants) that can be associated with a block on a block device. Also accounts for memory used by tmpfs. -> Unit: By -> IsMonotonic: false -> AggregationTemporality: Cumulative NumberDataPoints #0 StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Value: 339968 -> Name: container.memory.usage.limit -> Description: Memory limit of the container. -> DataType: Sum -> IsMonotonic: false StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Value: 16529591 Metric #9 -> Name: container.network.io.usage.tx_bytes -> Description: Bytes sent. -> DataType: Sum -> IsMonotonic: true NumberDataPoints #0 Data point attributes: Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Value: 0 Metric #11 -> Name: container.network.io.usage.tx_dropped -> Description: Outgoing packets dropped. -> Unit: {packets} -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 Data point attributes: -> interface: Str(eth0) StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Value: 0 ResourceMetrics #2 Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1 Resource attributes: -> container.runtime: Str(docker) -> container.name: Str(certdumper) ScopeMetrics #0 ScopeMetrics SchemaURL: InstrumentationScope otelcol/dockerstatsreceiver 0.76.3 Metric #0 -> Name: container.cpu.percent -> Description: Percent of CPU used by the container. StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Value: 0.000000 Metric #1 Descriptor: -> Name: container.cpu.usage.kernelmode -> Unit: ns -> DataType: Sum StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Value: 211384068886 -> Description: Time spent by tasks of the cgroup in user mode (Linux). Time spent by all container processes in user mode (Windows). -> Unit: ns -> DataType: Sum -> IsMonotonic: true -> AggregationTemporality: Cumulative NumberDataPoints #0 StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Value: 82730000000 Metric #4 Descriptor: -> Name: container.memory.percent Value: 0.002932 Metric #5 Descriptor: -> Name: container.memory.total_cache -> Description: Total amount of memory used by the processes of this cgroup (and descendants) that can be associated with a block on a block device. Also accounts for memory used by tmpfs. -> Unit: By -> DataType: Sum -> IsMonotonic: false -> AggregationTemporality: Cumulative NumberDataPoints #0 StartTimestamp: 2023-05-01 21:53:55.417241272 +0000 UTC Timestamp: 2023-05-01 21:54:09.39371476 +0000 UTC Value: 10264576 Metric #6 Descriptor: -> Name: container.memory.usage.limit Descriptor: -> Name: container.memory.usage.total -> Description: Memory usage of the container. This excludes the total cache. -> IsMonotonic: false -> AggregationTemporality: Cumulative NumberDataPoints #0 -> container.image.name: Str(authelia/authelia:master) -> container.name: Str(authelia) ScopeMetrics #0 ScopeMetrics SchemaURL: InstrumentationScope otelcol/dockerstatsreceiver 0.76.3 Metric #0 Descriptor: -> Name: container.cpu.percent ```

This goes on for quite a whille like this with pretty much everything I'd like to see in prom indeed ;)

But still nothing more to see in /metrics...

Just to be fully transparent, here is the real original config file with all the comments and messyness, just removed the batch processor just in case, turns out it doesn't change anything...

otel-collector config file ``` receivers: otlp: protocols: grpc: docker_stats: endpoint: "unix:///var/run/docker.sock" api_version: 1.41 collection_interval: 2s timeout: 20s exporters: prometheus: endpoint: "0.0.0.0:8889" #const_labels: # label1: value1 logging: verbosity: detailed #zipkin: # endpoint: "http://zipkin-all-in-one:9411/api/v2/spans" # format: proto #jaeger: # endpoint: jaeger-all-in-one:14250 # tls: # insecure: true processors: batch: extensions: health_check: pprof: endpoint: :1888 zpages: endpoint: :55679 service: telemetry: logs: level: "debug" extensions: [pprof, zpages, health_check] pipelines: #traces: # receivers: [otlp] # processors: [batch] # #exporters: [logging, zipkin, jaeger] # exporters: [logging] #metrics: # receivers: [otlp] # processors: [batch] # #exporters: [logging, prometheus] # exporters: [prometheus] metrics/docker-orion: receivers: [docker_stats] processors: [] exporters: [logging,prometheus] ```

As I'm seeing all these stats I believe it answers one of my question as docker_stats should indeed exports metrics for each container, and otel-collector with docker-stats is a good candidate for me to replace cadvisor that is eating up my CPU on my poor synology NAS

Seeing it shows up in the logs now I'm guessing I messed up some config, still very new to otel, sorry if I flagged this as a docke_stats issue where it might very well be a misstake by me, if you can help me find it that would be great !

Not that I tried the same setup on an ubuntu VM runing docker, just to insure there is no trickery specifi to synology and no, same setup also does not give the per container metrics in the /metrics endpoint of otel-collector :(

jamesmoessis commented 1 year ago

No problem at all @toxic0berliner - it does looks like it's something to do with the prom exporter. My advice would be to look at how the prom exporter handles resource attributes. Unfortunately on the top of my head I don't have the answer to why those attributes aren't showing up on /metrics.

toxic0berliner commented 1 year ago

Would you maybe know how to enable some more logs on the prom exporter ?

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

toxic0berliner commented 1 year ago

Are we sure here it's not an issue with this collector and with the Prometheus exporter instead?

toxic0berliner commented 1 year ago

Can someone maybe relabel this as exporter/prometheus and help me have a look? Can't seem to find anything to solve it by myself :(

Artyomlad commented 1 year ago

Same here. To verify we used this receiver with "file exporter" (instead of Prometheus) and indeed file content contained all the docker container metrics.

In Prometheus exporter we used the flag "resource_to_telemetry_conversion: true" and it got different output that do include the individual docker containers.

toxic0berliner commented 1 year ago

Amazing will test that out thanks for the hint on this hidden setting

toxic0berliner commented 1 year ago

got it working, thanks a lot, now need to submit an enhancement request tu expose the labels as I am missing container_label_com_docker_compose_project at least, very much needed to group metrics by compose projects ;)

toxic0berliner commented 1 year ago

Can close it, works like a charm an config can be adjusted to expose docker labels so all fine and working, thanks again @Artyomlad for the kind help!