open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.12k stars 2.39k forks source link

Pod and container level io stats via cgroups #35218

Open RainofTerra opened 2 months ago

RainofTerra commented 2 months ago

Component(s)

receiver/hostmetrics, receiver/kubeletstats

Is your feature request related to a problem? Please describe.

In the past we have used something like telegraf with an iostats plugin to monitor system-wide I/O statistics (IOPS, throughput, etc.) on servers running high I/O services (like our internal datastore, or Kafka). In Kubernetes (we're using EKS) that data is available at the various cgroup levels with io.stat. Pod level:

[root@ip-1-2-3-4 kubepods-burstable-podcfff9e92_9e21_41e7_b59e_59dfaeca3c2b.slice]# pwd
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcfff9e92_9e21_41e7_b59e_59dfaeca3c2b.slice
[root@ip-1-2-3-4 kubepods-burstable-podcfff9e92_9e21_41e7_b59e_59dfaeca3c2b.slice]# cat io.stat
259:0 rbytes=158363648 wbytes=0 rios=8977 wios=0 dbytes=0 dios=0

Container level:

[root@ip-1-2-3-4 cri-containerd-a796269837fbc314e36e5d0b1997e558548c68bfc3c6819fec00d49abb9b4d90.scope]# pwd
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcfff9e92_9e21_41e7_b59e_59dfaeca3c2b.slice/cri-containerd-a796269837fbc314e36e5d0b1997e558548c68bfc3c6819fec00d49abb9b4d90.scope
[root@ip-1-2-3-4 cri-containerd-a796269837fbc314e36e5d0b1997e558548c68bfc3c6819fec00d49abb9b4d90.scope]# cat io.stat
259:0 rbytes=135905280 wbytes=0 rios=7120 wios=0 dbytes=0 dios=0

Describe the solution you'd like

It would be useful to be able to take something like system.disk.operations and group it by pod name and container name. Currently we can only get it for the overall node. This would let us do things like monitor the iO of individual containers (we have both a reader and a writer container, we'd like to be able to see their IO separately).

Describe alternatives you've considered

No response

Additional context

No response

github-actions[bot] commented 2 months ago

Pinging code owners:

ChrsMark commented 2 months ago

It would be useful to be able to take something like system.disk.operations and group it by pod name and container name. Currently we can only get it for the overall node. This would let us do things like monitor the iO of individual containers (we have both a reader and a writer container, we'd like to be able to see their IO separately).

If I understand this correctly the proposal is to emit a metric called system.disk.operations with proper container and k8s metadata as attributes?

My concern here is that we should first come up with a valid data model. At the moment the system.* namespace is supposed to be used for metrics that are related to a system/host/vm etc as a whole. Then we have process.* namespace for per process metrics. So in that case I assume we should emit per container/pod metrics, right?

On another note, I wonder if this metric can come directly by scraping the cadvisor's prometheus endpoint: https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md#prometheus-container-metrics. In that case that would be already possible by using the prometheus receiver?

github-actions[bot] commented 1 week ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.