Open mrliptontea opened 3 years ago
What's the issue going on? I met the same problem. See the picture above, the overlapping parts means pod is restarting, and there are two metrics during that time which only have a difference in the field "id"(cgroup path), if I use Promesql like:
sum(rate(container_cpu_usage_seconds_total{xxx}[$interval])) by (pod)
To give a summary, the value will be double. Please tell me if you have already fixed it, thanks
Did you ever solve this? We're experiencing the exact same.
I have solved this by change
sum(rate(container_cpu_usage_seconds_total{xxx}[$interval])) by (pod)
to
max(rate(container_cpu_usage_seconds_total{xxx}[$interval])) by (pod)
Now that I think it actually makes kind of sense...
When a container within a Pod terminates due to OOM, k8s will automatically create a new container based on the restartPolicy.
It seems that cAdvisor will cache the monitoring metrics of the old container for 4 minutes, so sum(container_memory_working_set_bytes{container!~"POD|", pod="xx"})
will calculate the total memory of the new container and the old container. It looks like this Pod is using twice as much memory as usual.
To avoid the above problem, I can use this query expression to monitor the memory overhead of each Pod.
container_memory_working_set_bytes{container='', pod!=''}
container=''
is to query the root Cgroup node of each Pod, which has a resource overhead equal to the sum of the individual containers.
I never actually managed to find a solution, just whenever dealing with OOM kills I'm wary of this problem so never trust peak memory usages.
But I suppose something like sum(max by (container) (...))
would work for single- and multi-container (sidecars) pods.
I think I'm seeing something similar. When using Karpenter (AWS EKS) for auto-scaling, Karpenter will add the label node.kubernetes.io/exclude-from-external-load-balancers
as the node is about to go. This cause duplicate metrics to show up in these cadvisor/kubelet container metrics which is a nightmare when doing prometheus queries, especially if performing query operators (e.g. /
, *
, etc). Your query might be fine, then you'll hit a window where there's duplicates and have errors about needing group_left/group_right
to deal with one-to-many or many-to-one metrics. Hack solution is to always do some aggregation, e.g. max
, sum
, or something of the like.
It seems that cAdvisor will cache the monitoring metrics of the old container for 4 minutes
@LeoHsiao1 Just curious where you found this? I'm trying understand the situation better and if this is expected behavior or a bug. I tried to reproduce it locally by running a cadvisor
container, having /metrics
open, then starting and stopping an nginx
container. When it stopped the metrics disappeared right away, so I'm not sure if this has the same behavior as cadvisor
built into kubelet
.
@jtnz Hi This 4 minutes is my speculation based on prometheus charts, and has no theoretical basis.
How should I interpret a reporting for the same time and the same container but with DIFFERENT names?
Is the used memory the sum of these two (915144704 + 799879168) or the maximum of these two values (max(915144704, 799879168))?
My guess is the maximum and it's just a duplicate reporting with the same timestamp but different microseconds. But another possibility would be a container restarting and both memories belong to "different" containers during overlapping, then a sum formula would be correct, as both containers actually use the reported memory bytes...
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/podb7ecf824-6797-4183-9566-434142df3757/e5c6aec2cc6cf7abbd2163cd195b181954df28fc134478e2cfd27283c2a7838f", image="my-image", instance="172.22.106.15:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-x4nsr_my-app_b7ecf824-6797-4183-9566-434142df3757_7", namespace="my-app", node="node2", pod="web-5dfd4896b4-x4nsr"} 915144704
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/podb7ecf824-6797-4183-9566-434142df3757/fcce62cda4d8300c82f4552ddd069b59e8de30c31ece187a6349fe786f182e7a", image="my-image", instance="172.22.106.15:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-x4nsr_my-app_b7ecf824-6797-4183-9566-434142df3757_8", namespace="my-app", node="node2", pod="web-5dfd4896b4-x4nsr"} 799879168
I appreciate your feedback.
Same here, any news on this ?
I believe I have encountered a bug where multiple values are exported for the same Pod at the same point in time when that Pod has been restarted.
I have been doing some load tests against an app in K8s and I noticed something. The Pod had a limit set to 1Gi and while I was attacking the app with requests the Pods restarted a few times.
When I looked at the graphs in Grafana, it seemed like Pods are using way over 2.6GiB of memory. That didn't make much sense so I investigated the query, which led me to finding this issue.
Querying the
container_memory_working_set_bytes
metric in Prometheus I got the following:Notice how individual starts of the Pods are recorded for the same point in time - see the
name
labels ending_7
,_8
,_9
,_10
for example. These are four instances of the same exact Pod but in reality they never ran at the same time (they're restarts). If I add together these values it will give 2.6GiB which is the number I saw in Grafana. I can confirm this from other graphs that memory usage on the nodes never registered a 2.6GiB increase but they saw 1GiB, which is my limit.I use Amazon EKS with Kubernetes version v1.19.6-eks-49a6c0 which I believe uses cAdvisor v0.37.3.