google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
16.98k stars 2.31k forks source link

Add detailed memory metrics #3197

Open smcgivern opened 1 year ago

smcgivern commented 1 year ago

Currently, cAdvisor reports the following memory stats: https://github.com/google/cadvisor/blob/ce07bb28eadc18183df15ca5346293af6b020b33/container/libcontainer/handler.go#L802-L844

  1. Usage
  2. MaxUsage
  3. FailCnt
  4. Cache
  5. RSS
  6. Swap
  7. MappedFile
  8. WorkingSet - this is usage - inactive_file

In our case, we'd like more detailed metrics. We find that working set is often over what we want to account for (https://github.com/google/cadvisor/issues/3081 is one example), because, to quote a colleague:

container_memory_working_set_bytes is not what the OOM killer uses, but it is a better leading indicator of OOM risk than just the plain container_memory_usage_bytes. As long as the container's cgroup still has evictable filesystem cache pages, it will try hard to avoid killing processes, and container_memory_working_set_bytes subtracts some (but not all) of those pages.

A bit more about "evictable":

File pages in the "active" list are not evictable... until they get demoted back down to the "inactive" list. When the cgroup is starving for memory and needs to free a page (e.g. to satisfy a process requesting anonymous memory), it can shrink the total number of filesystem cache pages, and then the normal mechanism of demoting pages from the "active" list to the "inactive" list allows those previously unevictable pages to become eviction candidates the next time. There are only a few special cases where file-backed pages tend to not be evictable, which is why when we see an OOM kill event, the kernel's verbose logs for that kill typically show that most of the memory was anonymous, not file-backed.

So from the perspective of the container_memory_working_set_bytes metric, as memory pressure causes the container to shrink its number of file-backed pages to make room for more anonymous memory, both the "active" and "inactive" lists of file-backed pages will tend to shrink. So before reaching OOMK, the metric should be dominated by anonymous memory, and the lead-up to that point should be more or less gradual depending on the relative sizes of the active vs. inactive lists of filesystem cache pages.

I lean towards treating just the anonymous memory by itself as a saturation metric, since on swapless hosts it is guaranteed to be unevictable.

memory.stat includes active_anon, inactive_anon, active_file, and inactive_file (among others) but these are not exposed by cAdvisor currently: https://docs.kernel.org/admin-guide/cgroup-v1/memory.html#stat-file

Would you accept a patch to add those?

(Side note: we'd like to use RSS instead of WSS, but then we run into issues with programs that use MADV_FREE. Go dropped this in https://github.com/golang/go/issues/42330, but programs in other languages may still use this, which inflates RSS above what might be expected.)

smcgivern commented 1 year ago

This seems similar to https://github.com/google/cadvisor/issues/2634; in our case, if we had LazyFree exposed, we could also take RSS - LazyFree to get the value we're interested in. It looks like https://github.com/google/cadvisor/pull/2767 went stale, though.

smcgivern commented 1 year ago

https://github.com/google/cadvisor/compare/master...smcgivern:cadvisor:add-detailed-memory-stats does this, but I'm assuming we'd want to do it conditionally as it adds four metrics series to everywhere we collect memory metrics.