google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
17.02k stars 2.31k forks source link

empty container field gives 2x sum in container_cpu_usage_seconds_total #2688

Open minrk opened 4 years ago

minrk commented 4 years ago

We have some grafana charts from promethus that look like:

sum(irate(container_cpu_usage_seconds_total{pod=~"name-.*"}[2m])) by (pod)

However, we noticed after a recent upgrade to GKE v1.17.9-gke.1504 (from 1.16) that resources seemed to spike, and it turns out that in addition to an entry for each container, there is matching entry with container undefined, which appears to be the sum of the actual containers, so our charts started reporting always ~2x the 'real' usage. Is there a recommended fix for this, or do we need to manually add ,container!="" to all of our queries to get accurate sums? Or am I misinterpreting what the undefined-container entry is?

The missing-container results are always missing container, image, name keys and are otherwise identical to the 'real' metrics.

Our prometheus is public, so you can see the results here.

I didn't find this exact issue searching for it, but happy for a close & link if this turns out to be a duplicate.

minrk commented 4 years ago

FWIW, container_memoryrss and presumably all container metrics are affected.

paulfantom commented 4 years ago

You need to add container!="POD" to your query.

You can look at kubernetes-mixin project to get ready to use dashboards and alerts which are tested by others. Generated YAMLs can be obtained from https://monitoring.mixins.dev/kubernetes/

inetkiller commented 2 years ago

I also have the same problem

fleetwoodstack commented 10 months ago

We have the exact same issue and I'm so surprised this isn't more recognised as a problem. It's caused double counting on any of our container metrics.

Oddly ours only start displaying 12 hours in the past (metrics now aren't affected, but metrics older than 12 hours are)

sidewinder12s commented 7 months ago

This also ends up blowing up cardinality and metric series at scale, which is a major problem with almost all metrics providers.