Open mrliptontea opened 3 years ago
Just chiming in re: pause
- I wasn't aware of what they were either, but according to this link the pause
container:
serves as the basis of Linux namespace sharing in the pod. And second, with PID (process ID) namespace sharing enabled, it serves as PID 1 for each pod and reaps zombie processes.
I am not completely sure about including them in the dashboard expressions, but I think it makes sense to do so by default since they are technically containers within the pod. Plus, it is relatively simple to change those expressions via Jsonnet anyway.
Regarding your second point, that is strange - but I can't see anything like it from pods that have restarted on my EKS cluster. Maybe you've come across a bug in cAdvisor?
Yeah as @snoord said, I think the pause container should be included, it does represent the memory usage of the entire pod.
On the second thing, I agree this might be a cAdvisor bug, but either way a max
by cluster, namespace, pod and container probably does make sense.
cc @paulfantom @csmarchbanks what do you think about the cAdvisor thing?
I opened a ticket https://github.com/google/cadvisor/issues/2844 so maybe we'll find the root cause.
These are four instances of the same exact Pod but in reality they never ran at the same time (they're restarts)
It seems that cAdvisor is showing stale data or that kubelet didn't clean up cgroup slices yet. In both cases suggested max
expression should help.
Yeah, from experience a bug like this will stick with us for some time, even if this is a bug and cAdvisor is fixed eventually just because of the nature that it's compiled into Kubernetes and people don't tend to update too often. So as @paulfantom said, I think applying the max, in any case, sounds reasonable (I would just make sure we add a comment why it is there).
Wouldn't it be best to choose the one with the largest "_#" value in the name
? Not really sure how to do that though with promql.
In my experience, our container gets OOMKilled, and this metric shows up with _1 and _2 as described, but the _1 was the one that was OOMKilled and shows a huge amount of memory, which isn't accurate with the state of the restarted container (_2). Though I guess if it's just temporary until the kubelet removes the old cgroup stuff, then max probably works ok.
I also think a max
is reasonable. I would prefer a small amount of time that we over report to highlight that a container was using too much memory. How long do the old _1
metrics last for? If more than a few minutes that could be a problem though.
I agree with @csmarchbanks
Just eyeballing it, on my AKS cluster (k8s 1.18) the metric for an old container that got OOMKilled stuck around for about 4.5 minutes. Not sure if that's typical or not though.
Maybe @dashpole can shed some light on what I must assume is some caching/staleness that's happening here? I don't think my opinion changed though, I think using the max
value of each point in time is probably best.
From my recollection, cAdvisor shouldn't be returning stats for multiple iterations of the same container in a single scrape. It either means there is a bug with cAdvisor not correctly detecting cgroup deletions, or the kubelet/container runtime not cleaning up cgroups for containers that have exited. You would have to check the kubelet/container runtime logs to investigate either of those.
Even without this bug, the sum of container usage isn't guaranteed to be the same as the pod's usage. cAdvisor collects metrics for each container at a different interval to spread out the load it generates. The reason why we added metrics for the POD cgroup in the first place is because they are more accurate if you want the usage of the pod as a whole.
From my recollection, cAdvisor shouldn't be returning stats for multiple iterations of the same container in a single scrape. It either means there is a bug with cAdvisor not correctly detecting cgroup deletions, or the kubelet/container runtime not cleaning up cgroups for containers that have exited. You would have to check the kubelet/container runtime logs to investigate either of those.
Even without this bug, the sum of container usage isn't guaranteed to be the same as the pod's usage. cAdvisor collects metrics for each container at a different interval to spread out the load it generates. The reason why we added metrics for the POD cgroup in the first place is because they are more accurate if you want the usage of the pod as a whole.
I came to the same conclusion, raised an issue here with Kubernetes to investigate the latter suggestion.
This issue has not had any activity in the past 30 days, so the
stale
label has been added to it.
stale
label will be removed if there is new activitykeepalive
label to exempt this issue from the stale check actionThank you for your contributions!
I have been doing some load tests against an app in K8s and I noticed something. The Pod had a limit set to 1Gi and while I was attacking the app with requests the Pods restarted a few times.
Then, when I looked at the graphs, it seemed like Pods are using way over 2.6GiB of memory. That didn't make much sense so I checked the query behind the graph:
https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/bcf159e29f9abb4713a36db277390d5771ab02e4/dashboards/resources/workload.libsonnet#L145-L151
Querying just the
container_memory_working_set_bytes
metric I got the following:A few things can be noticed here:
pause
(?) Pods are being included in my application's memory usage - I think this can be fixed by addingcontainer!="POD"
to the selector but I don't really understand what role these Pods have. Note that my cluster runs on Amazon EKS so they may not be commonly present on other clusters.name
labels ending_7
,_8
,_9
,_10
for example. These are four instances of the same exact Pod but in reality they never ran at the same time (they're restarts). I can confirm this from node graphs that memory usage on the nodes never registered a 2.6GiB increase but it saw 1GiB, which is my limit. So here I was kind of wondering ifmax
could be used instead ofsum
?I don't know whether this is the best place to discuss this because my issue can be either fixed here or, I suppose, in cAdvisor? Because why are restarts exported at the same time?