Duplicated metrics for restarted Pods

mrliptontea commented 3 years ago

I believe I have encountered a bug where multiple values are exported for the same Pod at the same point in time when that Pod has been restarted.

I have been doing some load tests against an app in K8s and I noticed something. The Pod had a limit set to 1Gi and while I was attacking the app with requests the Pods restarted a few times.

When I looked at the graphs in Grafana, it seemed like Pods are using way over 2.6GiB of memory. That didn't make much sense so I investigated the query, which led me to finding this issue.

Querying the container_memory_working_set_bytes metric in Prometheus I got the following:

container_memory_working_set_bytes{namespace="my-app", container="my-app"}
# result:
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/pod80ea171a-411f-48cf-a603-04bd71a03c1e/372804cd3c321ca3f908265296407d3e9f6bd06568aab5f7be7f7047081dd7dc", image="my-image", instance="172.22.218.162:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-zc7ll_my-app_80ea171a-411f-48cf-a603-04bd71a03c1e_7", namespace="my-app", node="node1", pod="web-5dfd4896b4-zc7ll"} 1038004224
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/pod80ea171a-411f-48cf-a603-04bd71a03c1e/8d37738784860ae70513378dba0df77acc15e52001cae5d73ab6799533d06a4d", image="my-image", instance="172.22.218.162:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-zc7ll_my-app_80ea171a-411f-48cf-a603-04bd71a03c1e_8", namespace="my-app", node="node1", pod="web-5dfd4896b4-zc7ll"} 815841280
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/pod80ea171a-411f-48cf-a603-04bd71a03c1e/89d3021410ad8d22f36b53ac714ad962c34a5debb4f6de60c7056620d0721156", image="my-image", instance="172.22.218.162:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-zc7ll_my-app_80ea171a-411f-48cf-a603-04bd71a03c1e_9", namespace="my-app", node="node1", pod="web-5dfd4896b4-zc7ll"} 948023296
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/pod80ea171a-411f-48cf-a603-04bd71a03c1e/4300c0a162a494e32ab02c5b3741d91cc6f4b22cc9049a8cdc3d008b57e2dd8b", image="my-image", instance="172.22.218.162:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-zc7ll_my-app_80ea171a-411f-48cf-a603-04bd71a03c1e_10", namespace="my-app", node="node1", pod="web-5dfd4896b4-zc7ll"}    85966848
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/podb7ecf824-6797-4183-9566-434142df3757/e384a937574526eebeee4438132a0e882c0517bce4ca84db14497d598cd93a09", image="my-image", instance="172.22.106.15:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-x4nsr_my-app_b7ecf824-6797-4183-9566-434142df3757_6", namespace="my-app", node="node2", pod="web-5dfd4896b4-x4nsr"}  980393984
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/podb7ecf824-6797-4183-9566-434142df3757/e5c6aec2cc6cf7abbd2163cd195b181954df28fc134478e2cfd27283c2a7838f", image="my-image", instance="172.22.106.15:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-x4nsr_my-app_b7ecf824-6797-4183-9566-434142df3757_7", namespace="my-app", node="node2", pod="web-5dfd4896b4-x4nsr"}  915144704
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/podb7ecf824-6797-4183-9566-434142df3757/fcce62cda4d8300c82f4552ddd069b59e8de30c31ece187a6349fe786f182e7a", image="my-image", instance="172.22.106.15:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-x4nsr_my-app_b7ecf824-6797-4183-9566-434142df3757_8", namespace="my-app", node="node2", pod="web-5dfd4896b4-x4nsr"}  799879168

Notice how individual starts of the Pods are recorded for the same point in time - see the name labels ending _7, _8, _9, _10 for example. These are four instances of the same exact Pod but in reality they never ran at the same time (they're restarts). If I add together these values it will give 2.6GiB which is the number I saw in Grafana. I can confirm this from other graphs that memory usage on the nodes never registered a 2.6GiB increase but they saw 1GiB, which is my limit.

I use Amazon EKS with Kubernetes version v1.19.6-eks-49a6c0 which I believe uses cAdvisor v0.37.3.

ZhangSIming-blyq commented 2 years ago

What's the issue going on? I met the same problem. See the picture above, the overlapping parts means pod is restarting, and there are two metrics during that time which only have a difference in the field "id"(cgroup path), if I use Promesql like:

sum(rate(container_cpu_usage_seconds_total{xxx}[$interval])) by (pod)

To give a summary, the value will be double. Please tell me if you have already fixed it, thanks

JoeAshworth commented 1 year ago

Did you ever solve this? We're experiencing the exact same.

ZhangSIming-blyq commented 1 year ago

I have solved this by change

sum(rate(container_cpu_usage_seconds_total{xxx}[$interval])) by (pod)

to

max(rate(container_cpu_usage_seconds_total{xxx}[$interval])) by (pod)

Now that I think it actually makes kind of sense...

LeoHsiao1 commented 1 year ago

When a container within a Pod terminates due to OOM, k8s will automatically create a new container based on the restartPolicy.

It seems that cAdvisor will cache the monitoring metrics of the old container for 4 minutes, so sum(container_memory_working_set_bytes{container!~"POD|", pod="xx"}) will calculate the total memory of the new container and the old container. It looks like this Pod is using twice as much memory as usual.

To avoid the above problem, I can use this query expression to monitor the memory overhead of each Pod.

container_memory_working_set_bytes{container='', pod!=''}

container='' is to query the root Cgroup node of each Pod, which has a resource overhead equal to the sum of the individual containers.

mrliptontea commented 1 year ago

I never actually managed to find a solution, just whenever dealing with OOM kills I'm wary of this problem so never trust peak memory usages.

But I suppose something like sum(max by (container) (...)) would work for single- and multi-container (sidecars) pods.

jtnz commented 1 year ago

I think I'm seeing something similar. When using Karpenter (AWS EKS) for auto-scaling, Karpenter will add the label node.kubernetes.io/exclude-from-external-load-balancers as the node is about to go. This cause duplicate metrics to show up in these cadvisor/kubelet container metrics which is a nightmare when doing prometheus queries, especially if performing query operators (e.g. /, *, etc). Your query might be fine, then you'll hit a window where there's duplicates and have errors about needing group_left/group_right to deal with one-to-many or many-to-one metrics. Hack solution is to always do some aggregation, e.g. max, sum, or something of the like.

jtnz commented 1 year ago

It seems that cAdvisor will cache the monitoring metrics of the old container for 4 minutes

@LeoHsiao1 Just curious where you found this? I'm trying understand the situation better and if this is expected behavior or a bug. I tried to reproduce it locally by running a cadvisor container, having /metrics open, then starting and stopping an nginx container. When it stopped the metrics disappeared right away, so I'm not sure if this has the same behavior as cadvisor built into kubelet.

LeoHsiao1 commented 1 year ago

@jtnz Hi This 4 minutes is my speculation based on prometheus charts, and has no theoretical basis.

renepupil commented 11 months ago

How should I interpret a reporting for the same time and the same container but with DIFFERENT names?

Is the used memory the sum of these two (915144704 + 799879168) or the maximum of these two values (max(915144704, 799879168))?

My guess is the maximum and it's just a duplicate reporting with the same timestamp but different microseconds. But another possibility would be a container restarting and both memories belong to "different" containers during overlapping, then a sum formula would be correct, as both containers actually use the reported memory bytes...

container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/podb7ecf824-6797-4183-9566-434142df3757/e5c6aec2cc6cf7abbd2163cd195b181954df28fc134478e2cfd27283c2a7838f", image="my-image", instance="172.22.106.15:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-x4nsr_my-app_b7ecf824-6797-4183-9566-434142df3757_7", namespace="my-app", node="node2", pod="web-5dfd4896b4-x4nsr"}  915144704
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/podb7ecf824-6797-4183-9566-434142df3757/fcce62cda4d8300c82f4552ddd069b59e8de30c31ece187a6349fe786f182e7a", image="my-image", instance="172.22.106.15:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-x4nsr_my-app_b7ecf824-6797-4183-9566-434142df3757_8", namespace="my-app", node="node2", pod="web-5dfd4896b4-x4nsr"}  799879168

I appreciate your feedback.

lerminou commented 8 months ago

Same here, any news on this ?

google / cadvisor

Duplicated metrics for restarted Pods #2844