consistent metric use for memory

brancz commented 5 years ago

The resource dashboards use inconsistent metrics for displaying memory usage. Currently the cluster dashboard uses RSS and the namespace one uses total usage. I would propose that both use working-set-bytes, and the pod dashboard continues to show the distinct types as a stacked graph.

@gouthamve @metalmatze @csmarchbanks @tomwilkie @paulfantom @kakkoyun

csmarchbanks commented 5 years ago

Why working set over RSS? I have found that working set can under report pretty significantly, and personally would prefer to over report using RSS. For example: working_set_vs_heap_stack

brancz commented 5 years ago

I am ok with RSS as well (although unfortunately not as representative as working set bytes for go programs anymore), as long as we're consistent I'm happy :)

FWIW we probably should differentiate further between the different types in our Pod dashboard.

csmarchbanks commented 5 years ago

I agree the focus should be on consistency.

RSS is definitely not as useful as I would like for go >= 1.12. I would be happy to use working set if someone could explain to me how my above graphs show such different values. But otherwise, I think it would be safer to overestimate memory usage by using RSS than have a pod OOM and our reported memory not be close to the limit.

brancz commented 5 years ago

Yeah, I need to dig into the OOMkiller again, and I feel like whatever it uses should be the default that we use for display, and then show all the breakdown(s) in the Pod dashboard.

csmarchbanks commented 5 years ago

:+1: that sounds ideal. If you get to digging into the OOMKiller before me I would love to hear what you learn!

brancz commented 5 years ago

Reading this, it sounds like container_memory_working_set_bytes is the right metric to default to.

s-urbaniak commented 5 years ago

Disclaimer: I am not a virtual memory subsystem expert ;-) Just working on consolidating those metrics.

I agree with @brancz on using container_memory_working_set_bytes. It originates from the actual cgroup memory controller. When looking at the cadvisor code, it is calculated as

container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file

Which has RSSish semantics (as in "accounted resident memory" minus "unused file caches") although it might include some fuzziness as per https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt.

@csmarchbanks I rechecked your graph and noted that your stack query doesn't apply the {pod="prometheus-k8s-0"} filter.

On my cluster

go_memstats_heap_inuse_bytes{pod="prometheus-k8s-0"}+ go_memstats_stack_inuse_bytes{pod="prometheus-k8s-0"}

is less then container_memory_working_set_bytes{pod="prometheus-k8s-0"} which is expected

The latter also accounts for active (aka non evictable) filesystem cache memory which is not present in the heap/stack golang metrics.

s-urbaniak commented 5 years ago

ugh nevermind :man_facepalming: the subsequent stack query inherits the label selector from the heap query.

metalmatze commented 5 years ago

Did we kind of have an agreement on container_memory_working_set_bytes? This is what's used in #238. We could thus go ahead and merge that PR?

s-urbaniak commented 5 years ago

container_memory_working_set_bytes is the way to go for now and I agree with going ahead and merge #238 :+1:

Also for another documentation reference about the semantics of that metric: http://www.brendangregg.com/wss.html

(courtesy of @paulfantom)

csmarchbanks commented 5 years ago

I am ok with moving forward with container_memory_working_set_bytes. I would like to dig into the behavior (possible bug?) I posted above, but most of the time working set is good for me.

Also, @s-urbaniak I do not think the reference you posted by brendan gregg is the same working set as reported by cAdvisor. As you said above: container_memory_working_set_bytes = container_memory_usage_bytes - total_inactive_file, whereas the article you just provided tries to calculate recently touched memory.

paulfantom commented 5 years ago

container_memory_usage_bytes - total_inactive_file is a naive way to get "hot" memory (recently touched memory) otherwise known as WSS (Wokring Set Size).

csmarchbanks commented 5 years ago

I am going to echo what @s-urbaniak said and say I am also not a virtual memory subsystem expert.

Is it possible that the reason I am seeing such low working set size is that prometheus caches things in memory but does not touch them for a long time that they would be removed from container_memory_working_set_bytes? If so I am back to being against using WSS, because that memory cannot be reclaimed by the kernel, and an OOM could happen when WSS is very low.

Another datapoint, today I have a prom server with:

go_memstats_heap_inuse_bytes} + go_memstats_stack_inuse_bytes: 40 GB
WSS: 11 GB
RSS: 85 GB

paulfantom commented 5 years ago

I spent some more time dwelling into inner workings of kernel and kubernetes memory management system. From that I would say we have 3 main concerns about choosing right metric: 1) UX 2) OOMKiller 3) Pod eviciton

First one I hope is self explanatory, so let's look at second one - OOMKiller.

OOMKiller

This beast is taking only things that can be reliably measured by kernel and kill process with highest oom_score. Score is proportional to RSS + SWAP divided by total available memory [1][2] and also takes into consideration an adjuster in form of oom_score_adj (imporant for k8s [3]). Since everything in linux runs in a cgroup this score can be counted for any container by using "total available memory" of said cgroup (or a parent one if current cgroup doesn't have limits). So if we want to go only this route it seems like choosing RSS (+SWAP) would be the best way. However let's look at third option - Pod eviction.

Pod eviction

According to kubernetes documentation there are 5 signals which might cause pod eviciton [4] and only one of them relates to memory. Memory-based eviction signal is derived from cgoups and is known as memory.available which is counted as TOTAL_RAM - WSS (Working Set Size [5]). In this calculation kubelet excludes amount of bytes of file-backed memory on inactive LRU list known as inactive_file as this memory is reclaimable under pressure. It is worth noting that kubelet doesn't look at RSS, but makes its decisions based on WSS. So in this scenario it would be better to use WSS as it is more kubernetes specific. Now we just need to find out what is happening earlier, OOMKill or pod eviction to provide better UX.

What's first?

In normal conditions pod eviction should happen before OOMKill due to how node eviction thresholds [6] are set compared to all available memory. When thresholds are met kubelet should induce memory pressure and processes should avoid OOMKill. However due to how kubelet obtains data [7] there might be a case where it won't see a condition before OOMKiller kicks in.

Summary

Considering all those findings I would say that our reference metric responsible for "used" memory should be WSS. However we should keep in mind that this makes sense ONLY for kubernetes due to some additional memory tweaking made by kubelet on every pod.

[1]: https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L547-L557 [2]: https://github.com/torvalds/linux/blob/master//mm/oom_kill.c#L198-L240 [3]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior [4]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-policy [5]: http://brendangregg.com/wss.html [6]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-thresholds [7]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#kubelet-may-not-observe-memory-pressure-right-away

csmarchbanks commented 5 years ago

Thank you for the in depth description @paulfantom!

One point, I would say I experience far more OOMKills from container limits than pod evictions, but I am sure that depends on your deployment.

I am happy to use WSS for now, and see how it goes. Closing this ticket since #238 has already been merged.

kubernetes-monitoring / kubernetes-mixin