google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
17.23k stars 2.33k forks source link

container_fs metrics {device} has no label or info-metric to associate it with a volumeattachment or persistentvolumeclaim #3588

Open ringerc opened 2 months ago

ringerc commented 2 months ago

Problem

There appears to be no label or info-metric to associate cadvisor's container_fs_* metrics with a PersistentVolume attachment or PersistentVolumeClaim, or with the mount-point of the fs within the container. There is only a device label, for the device-node path within the OS.

This makes it seemingly impossible to determine which meaningful volume a container's I/O is associated with; for example, if a database container has two PVs mounted, one for the main DB and one for WAL, and it also has an ephemeral volume for tempfiles and sorts, there seems to be no way to tell which container_fs_writes_bytes_total metric is for which of the volumes.

Proposed feature

At would be enormously helpful if cadvsor added a label or exposed an info metric associating (device,container) label-pairs from container_fs_* metrics with the k8s volume attachment name or persistent volume claim.

It'd also be great to have an info-metric exposing the volume mount path within the container to the (container,device). This can't be done as an extra label on the container_fs_* metrics because one device node can be mounted multiple times within one container (bind mounts, subvolume mounts, btrfs submounts, etc). This would make it possible to see the container-path the volume is mounted on in monitoring. It would also then be possible to associate the persistent volume by exposing the volumeMount paths for a Pod in kube-state-metrics.

Exposing the filesystem uuid would also be helpful.

Alternatives considered

kube-state-metrics cannot provide this because it has no insight into the device node a container volumeMount path is associated with. There's nothing usable in PersistentVolume or PersistentVolumeClaim's .spec or .status. cadvisor doesn't appear to expose the csi info, pvc uid which could be used to associate these. There is a VolumeAttachment CR with .status.metadata.devicePath (for some CSIs) which k-s-m exposes as kube_volumeattachment_status_attachment_metadata but this only seems to be provided by the AWS EKS CSI, and the volume paths differ to those seen within the container e.g. an in-container /dev/dm-0 is exposed as /dev/xvdaa in the attachment metadata. Thus this is not usable for volume associations.

node-exporter recently gained filesystem_mount_info (https://github.com/prometheus/node_exporter/pull/2970) which maps device to mountpoint - but it isn't container-scoped and doesn't expose the volume attachment so it's not usable for associating the device with a PV. Its older node_filesystem_avail_bytes{device,mountpoint} similarly exposes host-path mount points under /run/containerd/io.containerd.grpc.v1.cri/ and has no info that could be used to associate with a PV, PV attachment or volumeMount. (Due to mount-scoping rules it cannot see some mounts anyway).

kubelet metrics don't appear to expose the needed info either, and there's nothing apparent in the main k8s metrics docs either. Querying kubectl get --raw "/api/v1/nodes/NODENAME/proxy/metrics" and kubectl get --raw "/api/v1/nodes/NODENAME/proxy/metrics/resource" didn't reveal anything promising.

Kubelet /stats/summary is (a) deprecated and (b) exposes the volume's name as listed in Pod.spec.volumes and any pvcRef but not the device-node or mount-path so it cannot be used to associate metrics. It doesn't have I/O stats so it's not an alternative data source either.

So I didn't find any way to associate the cadvisor metrics to the pv attachment, pvc, and pv by fs uuid, pv uuid, pvc uuid, data exposed by kube apiserver, other existing metrics api servers, etc.

Benefits

If a volumeattachment label was available directly or via an info-metric, this could be joined on kube_volumeattachment_spec_source_persistentvolume from kube-state-metrics to find the kube_persistentvolumeclaim_info, kube_persisistentvolume_info, kube_persistentvolumeclaim_labels, etc.

If a mapping of volumeMount paths to volumes and devices was available, I/O could be associated with a specific container path in reporting and dashboards, e.g. "100MiB/s on /postgres/data, 200 MiB/s on /postgres/pg_wal, 500MiB/s on /postgres/ephemeral_store_tablespace".

Details

cadvisor exposes some useful container-level filesystem I/O metrics:

which are exposed with labels including device (device node path the filesystem is mounted from) and name (container-id without containerd:// prefix), e.g.

container_fs_reads_bytes_total{container="...", device="/dev/dm-0", job="kubelet", metrics_path="/metrics/cadvisor", name="...", pod="...", ...}

There is nothing here, or in any of the other cadvisor metrics I found, that would allow this to be associated with a persistent volume claim. kube-state-metrics cannot expose this information because it does not have access to the device-node paths from which volumes are mounted within containers. See https://github.com/kubernetes/kube-state-metrics/issues/1701

Looking at the cadvisor source:

There's container_blkio_device_usage_total with major minor and operation but that doesn't provide any association; the rest only have device as a label.

It looks like cadvisor could expose an info-metric on device -> mount points using Mountpoint from https://pkg.go.dev/github.com/moby/sys/mountinfo#Info, and expose the filesystem uid too. This doesn't provide a way to associate with a PV or PVC directly, but might be usable indirectly via pod metadata from kube-state-metrics etc since volume and volumeMount on a Pod are exposed in the API.

Ideally kubelet could expose this mapping instead. Perhaps via https://kubernetes.io/docs/reference/instrumentation/cri-pod-container-metrics/ . There's no sign it does so though.


Related: https://github.com/google/cadvisor/issues/1702