google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
17.08k stars 2.32k forks source link

Missing/inconsistent labels for Prometheus container_* metrics? #3216

Open kustodian opened 1 year ago

kustodian commented 1 year ago

We are running GKE with Kubernetes 1.22 and we have some strange labels for the Prometheus container_* metrics. Let me give you a few examples.

I printed all the container_* metrics for a single pod with a PromQL like this:

{pod="node-exporter-24l92", __name__=~"container_.*", service="cadvisor"}

I'm not going to paste the whole output, but here are a few strange metrics.

Some metrics don't have container and image labels

For example container_memory_* looks like this:

container_memory_usage_bytes{container="node-exporter", image="docker.io/bitnami/node-exporter:1.0.1-debian-10-r107", instance="gke-prod-1-te-world-n2-custom-18-2252-ce95d402-tpdp:9338", job="prod-1-k8s-cadvisor", namespace="custom-system", pod="node-exporter-24l92"} 19812352
container_memory_usage_bytes{image="k8s.gcr.io/pause:3.5", instance="gke-prod-1-te-world-n2-custom-18-2252-ce95d402-tpdp:9338", job="prod-1-k8s-cadvisor", namespace="custom-system", pod="node-exporter-24l92"} 184320
container_memory_usage_bytes{instance="gke-prod-1-te-world-n2-custom-18-2252-ce95d402-tpdp:9338", job="prod-1-k8s-cadvisor", namespace="custom-system", pod="node-exporter-24l92"} 20099072

As you can see the first metric has the container and image labels, the second one doesn't have container, while the third one doesn't have both container and image labels. If the metric name is container_* I would expect that all these container metrics have the container label. Also, I don't understand what are these values. I would expect that if I sum 19812352 + 184320 would be the same as the last metric without any container/image labels, but they don't match:

19812352 + 184320 != 20099072

There are other metrics like this, e.g. container_cpu_*, container_file_*, container_last_seen, and others...

Also, it's very strange that we can see pause containers, even though it doesn't display it as a container.

Some metrics don't have container labels at all

For example container_network_* looks like this:

container_network_receive_bytes_total{image="k8s.gcr.io/pause:3.5", instance="gke-prod-1-te-world-n2-custom-18-2252-ce95d402-tpdp:9338", interface="eth0", job="prod-1-k8s-cadvisor", namespace="custom-system", pod="node-exporter-24l92"} 1857765794377
container_network_receive_bytes_total{image="k8s.gcr.io/pause:3.5", instance="gke-prod-1-te-world-n2-custom-18-2252-ce95d402-tpdp:9338", interface="gkeb0a170be88f", job="prod-1-k8s-cadvisor", namespace="custom-system", pod="node-exporter-24l92"} 851323237953

It's strange that these network metrics don't have the container label. Also, what do the values mean? It's also very strange seeing the pause container generating traffic and like above it's strange seeing the pause container at all.

My expectations are that all container_* metrics have both container and image labels, or they don't make much sense.

ebracho commented 3 days ago

This may be due to the fact that containers in a pod all share the same network namespace. All the containers are communicating over the same interfaces, so there's probably not a simple way to distinguish traffic metrics between containers.

The pause image is from a hidden container that runs in all pods, who's purpose is to preserve the network namespace config even when the other container(s) restart or die. See: https://stackoverflow.com/questions/48651269/what-are-the-pause-containers