Closed brancz closed 5 years ago
Why working set over RSS? I have found that working set can under report pretty significantly, and personally would prefer to over report using RSS. For example:
I am ok with RSS as well (although unfortunately not as representative as working set bytes for go programs anymore), as long as we're consistent I'm happy :)
FWIW we probably should differentiate further between the different types in our Pod dashboard.
I agree the focus should be on consistency.
RSS is definitely not as useful as I would like for go >= 1.12. I would be happy to use working set if someone could explain to me how my above graphs show such different values. But otherwise, I think it would be safer to overestimate memory usage by using RSS than have a pod OOM and our reported memory not be close to the limit.
Yeah, I need to dig into the OOMkiller again, and I feel like whatever it uses should be the default that we use for display, and then show all the breakdown(s) in the Pod dashboard.
:+1: that sounds ideal. If you get to digging into the OOMKiller before me I would love to hear what you learn!
Reading this, it sounds like container_memory_working_set_bytes
is the right metric to default to.
Disclaimer: I am not a virtual memory subsystem expert ;-) Just working on consolidating those metrics.
I agree with @brancz on using container_memory_working_set_bytes
. It originates from the actual cgroup memory controller. When looking at the cadvisor code, it is calculated as
container_memory_working_set_bytes
= container_memory_usage_bytes
- total_inactive_file
Which has RSSish semantics (as in "accounted resident memory" minus "unused file caches") although it might include some fuzziness as per https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt.
@csmarchbanks I rechecked your graph and noted that your stack query doesn't apply the {pod="prometheus-k8s-0"}
filter.
On my cluster
go_memstats_heap_inuse_bytes{pod="prometheus-k8s-0"}+ go_memstats_stack_inuse_bytes{pod="prometheus-k8s-0"}
is less then container_memory_working_set_bytes{pod="prometheus-k8s-0"}
which is expected
The latter also accounts for active (aka non evictable) filesystem cache memory which is not present in the heap/stack golang metrics.
ugh nevermind :man_facepalming: the subsequent stack query inherits the label selector from the heap query.
Did we kind of have an agreement on container_memory_working_set_bytes
? This is what's used in #238. We could thus go ahead and merge that PR?
container_memory_working_set_bytes
is the way to go for now and I agree with going ahead and merge #238 :+1:
Also for another documentation reference about the semantics of that metric: http://www.brendangregg.com/wss.html
(courtesy of @paulfantom)
I am ok with moving forward with container_memory_working_set_bytes
. I would like to dig into the behavior (possible bug?) I posted above, but most of the time working set is good for me.
Also, @s-urbaniak I do not think the reference you posted by brendan gregg is the same working set as reported by cAdvisor. As you said above: container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file
, whereas the article you just provided tries to calculate recently touched memory.
container_memory_usage_bytes - total_inactive_file
is a naive way to get "hot" memory (recently touched memory) otherwise known as WSS (Wokring Set Size).
I am going to echo what @s-urbaniak said and say I am also not a virtual memory subsystem expert.
Is it possible that the reason I am seeing such low working set size is that prometheus caches things in memory but does not touch them for a long time that they would be removed from container_memory_working_set_bytes
? If so I am back to being against using WSS, because that memory cannot be reclaimed by the kernel, and an OOM could happen when WSS is very low.
Another datapoint, today I have a prom server with:
go_memstats_heap_inuse_bytes} + go_memstats_stack_inuse_bytes
: 40 GBI spent some more time dwelling into inner workings of kernel and kubernetes memory management system. From that I would say we have 3 main concerns about choosing right metric: 1) UX 2) OOMKiller 3) Pod eviciton
First one I hope is self explanatory, so let's look at second one - OOMKiller.
This beast is taking only things that can be reliably measured by kernel and kill process with highest oom_score. Score is proportional to RSS + SWAP divided by total available memory [1][2] and also takes into consideration an adjuster in form of oom_score_adj (imporant for k8s [3]). Since everything in linux runs in a cgroup this score can be counted for any container by using "total available memory" of said cgroup (or a parent one if current cgroup doesn't have limits). So if we want to go only this route it seems like choosing RSS (+SWAP) would be the best way. However let's look at third option - Pod eviction.
According to kubernetes documentation there are 5 signals which might cause pod eviciton [4] and only one of them relates to memory. Memory-based eviction signal is derived from cgoups and is known as memory.available
which is counted as TOTAL_RAM - WSS (Working Set Size [5]). In this calculation kubelet excludes amount of bytes of file-backed memory on inactive LRU list known as inactive_file
as this memory is reclaimable under pressure. It is worth noting that kubelet doesn't look at RSS, but makes its decisions based on WSS. So in this scenario it would be better to use WSS as it is more kubernetes specific. Now we just need to find out what is happening earlier, OOMKill or pod eviction to provide better UX.
In normal conditions pod eviction should happen before OOMKill due to how node eviction thresholds [6] are set compared to all available memory. When thresholds are met kubelet should induce memory pressure and processes should avoid OOMKill. However due to how kubelet obtains data [7] there might be a case where it won't see a condition before OOMKiller kicks in.
Considering all those findings I would say that our reference metric responsible for "used" memory should be WSS. However we should keep in mind that this makes sense ONLY for kubernetes due to some additional memory tweaking made by kubelet on every pod.
[1]: https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L547-L557 [2]: https://github.com/torvalds/linux/blob/master//mm/oom_kill.c#L198-L240 [3]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior [4]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-policy [5]: http://brendangregg.com/wss.html [6]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-thresholds [7]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#kubelet-may-not-observe-memory-pressure-right-away
Thank you for the in depth description @paulfantom!
One point, I would say I experience far more OOMKills from container limits than pod evictions, but I am sure that depends on your deployment.
I am happy to use WSS for now, and see how it goes. Closing this ticket since #238 has already been merged.
The resource dashboards use inconsistent metrics for displaying memory usage. Currently the cluster dashboard uses RSS and the namespace one uses total usage. I would propose that both use working-set-bytes, and the pod dashboard continues to show the distinct types as a stacked graph.
@gouthamve @metalmatze @csmarchbanks @tomwilkie @paulfantom @kakkoyun