Open cyrus-mc opened 2 years ago
Good question! I don't want to overcrowd this issue, but why we don't subtract inactive_anon
as well?
@bwplotka I can understand why inactive_anon
isn't subtracted since most container clusters don't run with swap so that memory can't be swapped out and thus is part of the working set.
Is slab_reclaimable
the same? Since it is more of a cache thing it can be reclaimed by the OS when it needs memory.
Yea, agree, something is off, but for me, it's not really slab. I can reproduce this problem with a large number of open file descriptors. The WSS shows quite large memory usage:
WSS:
(file_mapped = 0)
RSS:
Stat file:
sudo cat /sys/fs/cgroup/system.slice/docker-40dc294092fde3c01f9c715c20a224aa34ff13e1efdb99526f93ec70c25533c7.scope/memory.stat
anon 20172800
file 3391561728
kernel_stack 311296
pagetables 282624
percpu 504
sock 4096
vmalloc 8192
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 30097408
active_anon 4096
inactive_file 3391561728
active_file 0
unevictable 0
slab_reclaimable 102486032
slab_unreclaimable 501088
slab 102987120
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgfault 177528
pgmajfault 0
pgrefill 0
pgscan 0
pgsteal 0
pgactivate 0
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 58
thp_collapse_alloc 44
Now, what's interesting dropping all cache pages on the host machine using sudo sysctl -w vm.drop_caches=1
cleans WSS to almost RSS🙃 Which kind of tell us it's reclaimable, no?
Simply dropping cache from WSS is no-go as cache is extremely large, yet kind of affecting the WSS:
Stats after:
sudo cat /sys/fs/cgroup/system.slice/docker-40dc294092fde3c01f9c715c20a224aa34ff13e1efdb99526f93ec70c25533c7.scope/memory.stat
anon 20291584
file 0
kernel_stack 311296
pagetables 282624
percpu 504
sock 4096
vmalloc 8192
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 30216192
active_anon 4096
inactive_file 0
active_file 0
unevictable 0
slab_reclaimable 1661888
slab_unreclaimable 496376
slab 2158264
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgfault 211983
pgmajfault 0
pgrefill 0
pgscan 0
pgsteal 0
pgactivate 0
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 58
thp_collapse_alloc 44
Why is not the Active(file) memory subtracted as well? I observe on my containers that the Active(file) memory is high even after a memory-heavy job ends, and the container_memory_working_set_bytes remains at a high value.
From what I understand, even the Active(file) memory is reclaimable, although it has a lesser priority than InActive(file).
@saswatac you don't want to subtract active file because the working set is meant to give you a metric for the active memory your container needs. Every app needs some file cache. If you just ignore that when assigning memory requests for your app it will degrade the performance.
After the pod accepted a large number of socket connections, the container_memory_working_set_bytes
remained around 2GB and never dropped, even when there were no new requests. Meanwhile, container_memory_rss
was only around 200+MB.
Before dropping the cache, I examined the memory stats:
# cat /sys/fs/cgroup/memory/memory.stat
cache 0
rss 218734592
rss_huge 109051904
shmem 0
mapped_file 0
dirty 0
writeback 0
swap 0
pgpgin 14037837
pgpgout 14021095
pgfault 14053809
pgmajfault 0
inactive_anon 2554662912
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 17179869184
hierarchical_memsw_limit 17179869184
total_cache 0
total_rss 218734592
total_rss_huge 109051904
total_shmem 0
total_mapped_file 0
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 14037837
total_pgpgout 14021095
total_pgfault 14053809
total_pgmajfault 0
total_inactive_anon 2554662912
total_active_anon 0
total_inactive_file 0
total_active_file 0
total_unevictable 0
I also checked the current memory usage:
# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
2556129280
To free reclaimable slab objects (which include dentries and inodes):
echo 2 > /proc/sys/vm/drop_caches
After executing the drop_caches command:
# cat /sys/fs/cgroup/memory/memory.stat
cache 1081344
rss 219250688
rss_huge 109051904
shmem 0
mapped_file 540672
dirty 0
writeback 0
swap 0
pgpgin 14038926
pgpgout 14592264
pgfault 14055426
pgmajfault 0
inactive_anon 219336704
active_anon 0
inactive_file 540672
active_file 675840
unevictable 0
hierarchical_memory_limit 17179869184
hierarchical_memsw_limit 17179869184
total_cache 1081344
total_rss 219250688
total_rss_huge 109051904
total_shmem 0
total_mapped_file 540672
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 14038926
total_pgpgout 14592264
total_pgfault 14055426
total_pgmajfault 0
total_inactive_anon 219336704
total_active_anon 0
total_inactive_file 540672
total_active_file 675840
total_unevictable 0
I checked the memory usage again:
# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
221339648
This issue is affecting the decisions made by the Horizontal Pod Autoscaler (HPA) regarding memory.
I ran into a somewhat unique situation in which a POD had very high slab memory - high as in 1.1GB worth. In terms of anonymous and active file the memory usage was only around 25MB. Working set calculate shows 1.1GB due to the fact that slab reclaimable memory isn't subtracted from
workingSet
calculation.Working set calculation
Where
MemoryStats.Usage.Usage
is the value frommemory.current
(cgroup v2) ormemory.usage_in_bytes
(cgroup v1). Memory statistics (memory.stat) contains the following fields:Of which
slab_reclaimable
is memory that can be reclaimed by the OS when needed. Should we be subtracting this value when calculatingworkingSet
?