Reporting memory usage with restarted Pods

mrliptontea commented 3 years ago

I have been doing some load tests against an app in K8s and I noticed something. The Pod had a limit set to 1Gi and while I was attacking the app with requests the Pods restarted a few times.

Then, when I looked at the graphs, it seemed like Pods are using way over 2.6GiB of memory. That didn't make much sense so I checked the query behind the graph:

https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/bcf159e29f9abb4713a36db277390d5771ab02e4/dashboards/resources/workload.libsonnet#L145-L151

Querying just the container_memory_working_set_bytes metric I got the following:

container_memory_working_set_bytes{cluster="", namespace="my-app", container!="", image!=""}
# result:
container_memory_working_set_bytes{container="POD", endpoint="https-metrics", id="/kubepods/burstable/pod80ea171a-411f-48cf-a603-04bd71a03c1e/f44210e4ab1b315ecb5eb3b5f1be4c802f3cb41009f71325f8e0a98ce4ceb921", image="602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1", instance="172.22.218.162:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_POD_web-5dfd4896b4-zc7ll_my-app_80ea171a-411f-48cf-a603-04bd71a03c1e_0", namespace="my-app", node="node1", pod="web-5dfd4896b4-zc7ll"}  598016
container_memory_working_set_bytes{container="POD", endpoint="https-metrics", id="/kubepods/burstable/podb7ecf824-6797-4183-9566-434142df3757/17f4fd5bb501e2a529114cfcaf6ce9bbfbf4d38276352c396343fffe0ab03d77", image="602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1", instance="172.22.106.15:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_POD_web-5dfd4896b4-x4nsr_my-app_b7ecf824-6797-4183-9566-434142df3757_0", namespace="my-app", node="node2", pod="web-5dfd4896b4-x4nsr"}   663552
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/pod80ea171a-411f-48cf-a603-04bd71a03c1e/372804cd3c321ca3f908265296407d3e9f6bd06568aab5f7be7f7047081dd7dc", image="my-image", instance="172.22.218.162:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-zc7ll_my-app_80ea171a-411f-48cf-a603-04bd71a03c1e_7", namespace="my-app", node="node1", pod="web-5dfd4896b4-zc7ll"} 1038004224
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/pod80ea171a-411f-48cf-a603-04bd71a03c1e/8d37738784860ae70513378dba0df77acc15e52001cae5d73ab6799533d06a4d", image="my-image", instance="172.22.218.162:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-zc7ll_my-app_80ea171a-411f-48cf-a603-04bd71a03c1e_8", namespace="my-app", node="node1", pod="web-5dfd4896b4-zc7ll"} 815841280
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/pod80ea171a-411f-48cf-a603-04bd71a03c1e/89d3021410ad8d22f36b53ac714ad962c34a5debb4f6de60c7056620d0721156", image="my-image", instance="172.22.218.162:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-zc7ll_my-app_80ea171a-411f-48cf-a603-04bd71a03c1e_9", namespace="my-app", node="node1", pod="web-5dfd4896b4-zc7ll"} 948023296
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/pod80ea171a-411f-48cf-a603-04bd71a03c1e/4300c0a162a494e32ab02c5b3741d91cc6f4b22cc9049a8cdc3d008b57e2dd8b", image="my-image", instance="172.22.218.162:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-zc7ll_my-app_80ea171a-411f-48cf-a603-04bd71a03c1e_10", namespace="my-app", node="node1", pod="web-5dfd4896b4-zc7ll"}    85966848
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/podb7ecf824-6797-4183-9566-434142df3757/e384a937574526eebeee4438132a0e882c0517bce4ca84db14497d598cd93a09", image="my-image", instance="172.22.106.15:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-x4nsr_my-app_b7ecf824-6797-4183-9566-434142df3757_6", namespace="my-app", node="node2", pod="web-5dfd4896b4-x4nsr"}  980393984
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/podb7ecf824-6797-4183-9566-434142df3757/e5c6aec2cc6cf7abbd2163cd195b181954df28fc134478e2cfd27283c2a7838f", image="my-image", instance="172.22.106.15:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-x4nsr_my-app_b7ecf824-6797-4183-9566-434142df3757_7", namespace="my-app", node="node2", pod="web-5dfd4896b4-x4nsr"}  915144704
container_memory_working_set_bytes{container="my-app", endpoint="https-metrics", id="/kubepods/burstable/podb7ecf824-6797-4183-9566-434142df3757/fcce62cda4d8300c82f4552ddd069b59e8de30c31ece187a6349fe786f182e7a", image="my-image", instance="172.22.106.15:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_my-app_web-5dfd4896b4-x4nsr_my-app_b7ecf824-6797-4183-9566-434142df3757_8", namespace="my-app", node="node2", pod="web-5dfd4896b4-x4nsr"}  799879168

A few things can be noticed here:

The pause (?) Pods are being included in my application's memory usage - I think this can be fixed by adding container!="POD" to the selector but I don't really understand what role these Pods have. Note that my cluster runs on Amazon EKS so they may not be commonly present on other clusters.
Individual instances (starts) of my Pods are recorded at the same point in time - notice name labels ending _7, _8, _9, _10 for example. These are four instances of the same exact Pod but in reality they never ran at the same time (they're restarts). I can confirm this from node graphs that memory usage on the nodes never registered a 2.6GiB increase but it saw 1GiB, which is my limit. So here I was kind of wondering if max could be used instead of sum?

I don't know whether this is the best place to discuss this because my issue can be either fixed here or, I suppose, in cAdvisor? Because why are restarts exported at the same time?

snoord commented 3 years ago

Just chiming in re: pause - I wasn't aware of what they were either, but according to this link the pause container:

serves as the basis of Linux namespace sharing in the pod. And second, with PID (process ID) namespace sharing enabled, it serves as PID 1 for each pod and reaps zombie processes.

I am not completely sure about including them in the dashboard expressions, but I think it makes sense to do so by default since they are technically containers within the pod. Plus, it is relatively simple to change those expressions via Jsonnet anyway.

Regarding your second point, that is strange - but I can't see anything like it from pods that have restarted on my EKS cluster. Maybe you've come across a bug in cAdvisor?

brancz commented 3 years ago

Yeah as @snoord said, I think the pause container should be included, it does represent the memory usage of the entire pod.

On the second thing, I agree this might be a cAdvisor bug, but either way a max by cluster, namespace, pod and container probably does make sense.

cc @paulfantom @csmarchbanks what do you think about the cAdvisor thing?

mrliptontea commented 3 years ago

I opened a ticket https://github.com/google/cadvisor/issues/2844 so maybe we'll find the root cause.

paulfantom commented 3 years ago

These are four instances of the same exact Pod but in reality they never ran at the same time (they're restarts)

It seems that cAdvisor is showing stale data or that kubelet didn't clean up cgroup slices yet. In both cases suggested max expression should help.

brancz commented 3 years ago

Yeah, from experience a bug like this will stick with us for some time, even if this is a bug and cAdvisor is fixed eventually just because of the nature that it's compiled into Kubernetes and people don't tend to update too often. So as @paulfantom said, I think applying the max, in any case, sounds reasonable (I would just make sure we add a comment why it is there).

m1o1 commented 3 years ago

Wouldn't it be best to choose the one with the largest "_#" value in the name? Not really sure how to do that though with promql.

In my experience, our container gets OOMKilled, and this metric shows up with _1 and _2 as described, but the _1 was the one that was OOMKilled and shows a huge amount of memory, which isn't accurate with the state of the restarted container (_2). Though I guess if it's just temporary until the kubelet removes the old cgroup stuff, then max probably works ok.

csmarchbanks commented 3 years ago

I also think a max is reasonable. I would prefer a small amount of time that we over report to highlight that a container was using too much memory. How long do the old _1 metrics last for? If more than a few minutes that could be a problem though.

brancz commented 3 years ago

I agree with @csmarchbanks

m1o1 commented 3 years ago

Just eyeballing it, on my AKS cluster (k8s 1.18) the metric for an old container that got OOMKilled stuck around for about 4.5 minutes. Not sure if that's typical or not though.

brancz commented 3 years ago

Maybe @dashpole can shed some light on what I must assume is some caching/staleness that's happening here? I don't think my opinion changed though, I think using the max value of each point in time is probably best.

dashpole commented 3 years ago

From my recollection, cAdvisor shouldn't be returning stats for multiple iterations of the same container in a single scrape. It either means there is a bug with cAdvisor not correctly detecting cgroup deletions, or the kubelet/container runtime not cleaning up cgroups for containers that have exited. You would have to check the kubelet/container runtime logs to investigate either of those.

Even without this bug, the sum of container usage isn't guaranteed to be the same as the pod's usage. cAdvisor collects metrics for each container at a different interval to spread out the load it generates. The reason why we added metrics for the POD cgroup in the first place is because they are more accurate if you want the usage of the pod as a whole.

JoeAshworth commented 1 year ago

From my recollection, cAdvisor shouldn't be returning stats for multiple iterations of the same container in a single scrape. It either means there is a bug with cAdvisor not correctly detecting cgroup deletions, or the kubelet/container runtime not cleaning up cgroups for containers that have exited. You would have to check the kubelet/container runtime logs to investigate either of those.

Even without this bug, the sum of container usage isn't guaranteed to be the same as the pod's usage. cAdvisor collects metrics for each container at a different interval to spread out the load it generates. The reason why we added metrics for the POD cgroup in the first place is because they are more accurate if you want the usage of the pod as a whole.

I came to the same conclusion, raised an issue here with Kubernetes to investigate the latter suggestion.

github-actions[bot] commented 1 day ago

This issue has not had any activity in the past 30 days, so the stale label has been added to it.

The stale label will be removed if there is new activity
The issue will be closed in 7 days if there is no new activity
Add the keepalive label to exempt this issue from the stale check action

Thank you for your contributions!

kubernetes-monitoring / kubernetes-mixin

Reporting memory usage with restarted Pods #585