openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.76k forks source link

ZFS on Linux confusingly reports its cache usage as kernel memory in use #10251

Open vegabook opened 4 years ago

vegabook commented 4 years ago

Distribution Name | Ubuntu Focal Fossa Distribution Version | 20.04 Linux Kernel | 5.4.0 Architecture | AMD64 ZFS Version | 0.8.3-1ubuntu12 SPL Version | 0.8.3-1ubuntu12

As per this question on Ask Ubuntu ZFS, after doing a bunch of big file manipulations, is reporting sizable memory usage for cache purposes as "green" kernel "memory in use" on htop, this case 20GB (after fresh boot this was 5GB):

image

... and on smem:

tbrowne@RyVe:~$ smem -tw
Area                           Used      Cache   Noncache 
firmware/hardware                 0          0          0 
kernel image                      0          0          0 
kernel dynamic memory      20762532     435044   20327488 
userspace memory            2290448     519736    1770712 
free memory                42823220   42823220          0 
----------------------------------------------------------
                           65876200   43778000   22098200 

This large memory usage on a 64GB system is not a problem as it's mainly being used for cache, as I understand, and indeed, on the same system (as per htop which shows said system only has 2GB of swap), I am able to allocate around a full 64GB to this userspace script in Python3 (must first pip3 install numpy):

import numpy as np
xx = np.random.rand(10000, 12500)
import sys
sys.getsizeof(xx)
# 1000000112
# that's about 1 GB
ll = []
index = 1
while True:
    print(index)
    ll.append(np.random.rand(10000, 12500))
    index = index + 1

In other words, ZFS is not really using up all that RAM, because said RAM is freed nicely for userspace programs when needed. The problem though is, this is not really accurately reported by either htop or the system monitor, or most other memory in use programs. One has to go as far as parsing the output of ps aux to get a real picture (ps aux | awk '{print $6/1024}' | paste -s -d+ - | bc).

Given that ZFS is a very good usage experience on the new Ubuntu 20.04, which is a "mass use distribution" and that many people, notwithstanding the "experimental" tag, will use it, would it be worth ensuring that this cache usage is reported correctly as cache (yellow bars on htop's memory graph) or buffer (blue bars).

Describe how to reproduce the problem

First reboot the system and note memory use on htop. Then run this python3 script, ensuring numpy and pandas are installed:

import numpy as np
import pandas as pd
pd.DataFrame(np.random.rand(10000, 10000)).to_csv("bigrand.csv")

This will create a 1.8GB CSV file full of random numbers. At a bash prompt, concatenate its 40 times to itself to create a 72GB file:

for i in {1..40}; do cat bigrand.csv >> biggest.csv; done

The creation of a huge file in this way will cause ZFS to ramp up its RAM usage as described above, showing it as green bars / kernel memory in use. Again if one does this and then runs the earlier Python script, which allocates 1GB at a time, you'll see that this ZFS cache RAM gets freed in favour of the userspace script, which is consistent with what a cache would do, and therefore since it is a cache, it should show up as such in htop and other similar tools.

The risk is to ZFS's reputation as a RAM hog which is not really true, as I think I've shown here.

shodanshok commented 4 years ago

While I agree that this is a confusing behavior, I don't think it is ZFS fault here: as ARC lives in kernel memory but it is not "pagecache", the kernel itself report a "wrong" number as allocated memory. After all, it really is allocated memory - but while "you" (the user) know that it is readily reclamable, the kernel does not know that. The pagecache has a "special exception" due to it being tightly coupled to core kernel vm and vfs management.

As a side note, this bad reporting has real downsides on some processes. For example, having an huge ARC and some reasonable application memory usage, the ksmtuned daemon starts to aggressively scan memory for page merging even if much memory can be reclamed shrinking ARC by some amount.

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

ShaheedHaque commented 3 years ago

I am a knowledgeable user of computer systems, including Linux and Ubuntu, and I did my homework before adopting ZFS (which I really like). But I'm not a ZFS expert and completely missed this "under-reporting" of memory usage on my 48GB Zen3 file and development server.

So, when my Python development tests (based on pytest and Selenium on Chrome), plus some Jellyfin streaming by my son, started failing, it took a long time and a lot of trial and error before I stumbled on arc_summary and the relevant tunable.

Now I understand that in theory, the ARC cache should yield memory to userspace, but it repeatably caused this setup to misbehave till I set the tunable. I'm writing here because it would have been very helpful in root causing the issue to have had some indication from standard tools like "htop" or "mem" to indicate memory might be an issue. This is arguably similar to #10255.

(I could file a separate issue for the actual problem, but my ability to help diagnose it may be hampered at present).

devZer0 commented 9 months ago

As a side note, this bad reporting has real downsides on some processes. For example, having an huge ARC and some >reasonable application memory usage, the ksmtuned daemon starts to aggressively scan memory for page merging even if much memory can be reclamed shrinking ARC by some amount.

@shodanshok yes, but this should be easy to be adressed at ksmtuned level.

if somebody would do...

https://forum.proxmox.com/threads/ksm-is-needlessly-burning-cpu.142397/

anyhow, ksmtune seems a pretty dead projekt

https://github.com/ksmtuned/ksmtuned https://github.com/aruhier/ksmtuned/tree/master

and it's so rudimentary.

there may be SO MUCH to optimize/enhance.

for example, to provide better/easier interface to the end user, so anybody is able to approprately tune that by some "real world metric" like %maxcpu and %maxcpuboost and not by such total obscure/abstract internal numbers like npages and "millisecs sleep between scans".

i don't want to know how many kwh of energy being wasted worldwide because ksm needlessly trying to free pages over and over again instead stopping when there is nothing more to optimize.