Metric `container_memory_working_set_bytes` includes `slab` reclaimable memory

cyrus-mc commented 2 years ago

I ran into a somewhat unique situation in which a POD had very high slab memory - high as in 1.1GB worth. In terms of anonymous and active file the memory usage was only around 25MB. Working set calculate shows 1.1GB due to the fact that slab reclaimable memory isn't subtracted from workingSet calculation.

Working set calculation

  ret.Memory.Usage = s.MemoryStats.Usage.Usage
  workingSet := ret.Memory.Usage
  if v, ok := s.MemoryStats.Stats[inactiveFileKeyName]; ok {
    if workingSet < v {
      workingSet = 0
    } else {
      workingSet -= v
    }
  }
  ret.Memory.WorkingSet = workingSet

Where MemoryStats.Usage.Usage is the value from memory.current (cgroup v2) or memory.usage_in_bytes (cgroup v1). Memory statistics (memory.stat) contains the following fields:

anon 663552
file 10313728
kernel_stack 49152
...
inactive_anon 573440
active_anon 32768
inactive_file 5066752
active_file 5246976
unevictable 0
slab_reclaimable 1232589368
slab_unreclaimable 128408
slab 1232717776
...

Of which slab_reclaimable is memory that can be reclaimed by the OS when needed. Should we be subtracting this value when calculating workingSet?

bwplotka commented 2 years ago

Good question! I don't want to overcrowd this issue, but why we don't subtract inactive_anon as well?

Rationale: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt#:~:text=inactive_anon%09%2D%20%23%20of%20bytes%20of%20anonymous%20and%20swap%20cache%20memory%20on%20inactive%0A%09%09LRU%20list.

cyrus-mc commented 2 years ago

@bwplotka I can understand why inactive_anon isn't subtracted since most container clusters don't run with swap so that memory can't be swapped out and thus is part of the working set.

Is slab_reclaimable the same? Since it is more of a cache thing it can be reclaimed by the OS when it needs memory.

bwplotka commented 2 years ago

Yea, agree, something is off, but for me, it's not really slab. I can reproduce this problem with a large number of open file descriptors. The WSS shows quite large memory usage:

WSS:

(file_mapped = 0)

RSS:

Stat file:

sudo cat /sys/fs/cgroup/system.slice/docker-40dc294092fde3c01f9c715c20a224aa34ff13e1efdb99526f93ec70c25533c7.scope/memory.stat
anon 20172800
file 3391561728
kernel_stack 311296
pagetables 282624
percpu 504
sock 4096
vmalloc 8192
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 30097408
active_anon 4096
inactive_file 3391561728
active_file 0
unevictable 0
slab_reclaimable 102486032
slab_unreclaimable 501088
slab 102987120
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgfault 177528
pgmajfault 0
pgrefill 0
pgscan 0
pgsteal 0
pgactivate 0
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 58
thp_collapse_alloc 44

Now, what's interesting dropping all cache pages on the host machine using sudo sysctl -w vm.drop_caches=1 cleans WSS to almost RSS🙃 Which kind of tell us it's reclaimable, no?

Simply dropping cache from WSS is no-go as cache is extremely large, yet kind of affecting the WSS:

Stats after:

sudo cat /sys/fs/cgroup/system.slice/docker-40dc294092fde3c01f9c715c20a224aa34ff13e1efdb99526f93ec70c25533c7.scope/memory.stat
anon 20291584
file 0
kernel_stack 311296
pagetables 282624
percpu 504
sock 4096
vmalloc 8192
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 30216192
active_anon 4096
inactive_file 0
active_file 0
unevictable 0
slab_reclaimable 1661888
slab_unreclaimable 496376
slab 2158264
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgfault 211983
pgmajfault 0
pgrefill 0
pgscan 0
pgsteal 0
pgactivate 0
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 58
thp_collapse_alloc 44

saswatac commented 1 year ago

Why is not the Active(file) memory subtracted as well? I observe on my containers that the Active(file) memory is high even after a memory-heavy job ends, and the container_memory_working_set_bytes remains at a high value.

From what I understand, even the Active(file) memory is reclaimable, although it has a lesser priority than InActive(file).

cyrus-mc commented 1 year ago

@saswatac you don't want to subtract active file because the working set is meant to give you a metric for the active memory your container needs. Every app needs some file cache. If you just ignore that when assigning memory requests for your app it will degrade the performance.

astronaut0131 commented 2 months ago

After the pod accepted a large number of socket connections, the container_memory_working_set_bytes remained around 2GB and never dropped, even when there were no new requests. Meanwhile, container_memory_rss was only around 200+MB.

Before dropping the cache, I examined the memory stats:

# cat /sys/fs/cgroup/memory/memory.stat 
cache 0
rss 218734592
rss_huge 109051904
shmem 0
mapped_file 0
dirty 0
writeback 0
swap 0
pgpgin 14037837
pgpgout 14021095
pgfault 14053809
pgmajfault 0
inactive_anon 2554662912
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 17179869184
hierarchical_memsw_limit 17179869184
total_cache 0
total_rss 218734592
total_rss_huge 109051904
total_shmem 0
total_mapped_file 0
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 14037837
total_pgpgout 14021095
total_pgfault 14053809
total_pgmajfault 0
total_inactive_anon 2554662912
total_active_anon 0
total_inactive_file 0
total_active_file 0
total_unevictable 0
I also checked the current memory usage:

# cat /sys/fs/cgroup/memory/memory.usage_in_bytes 
2556129280

To free reclaimable slab objects (which include dentries and inodes):

echo 2 > /proc/sys/vm/drop_caches

After executing the drop_caches command:

# cat /sys/fs/cgroup/memory/memory.stat 
cache 1081344
rss 219250688
rss_huge 109051904
shmem 0
mapped_file 540672
dirty 0
writeback 0
swap 0
pgpgin 14038926
pgpgout 14592264
pgfault 14055426
pgmajfault 0
inactive_anon 219336704
active_anon 0
inactive_file 540672
active_file 675840
unevictable 0
hierarchical_memory_limit 17179869184
hierarchical_memsw_limit 17179869184
total_cache 1081344
total_rss 219250688
total_rss_huge 109051904
total_shmem 0
total_mapped_file 540672
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 14038926
total_pgpgout 14592264
total_pgfault 14055426
total_pgmajfault 0
total_inactive_anon 219336704
total_active_anon 0
total_inactive_file 540672
total_active_file 675840
total_unevictable 0

I checked the memory usage again:

# cat /sys/fs/cgroup/memory/memory.usage_in_bytes 
221339648

This issue is affecting the decisions made by the Horizontal Pod Autoscaler (HPA) regarding memory.

google / cadvisor

Metric `container_memory_working_set_bytes` includes `slab` reclaimable memory #3081