openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.52k stars 1.74k forks source link

Memory Leak (2x ARC) on Kernel 6.x (Debian12) #15637

Open gbonfiglio opened 10 months ago

gbonfiglio commented 10 months ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version Bookworm (12)
Kernel Version 6.1.0-13-amd64
Architecture x86_64
OpenZFS Version zfs-2.1.14-1~bpo12+1

Describe the problem you're observing

Since upgrading to Debian 12 a small system I run as rsync target and rsnapshot host went from a flat 50% RAM usage (out of 2GB total) to constantly OOM'ing even after an upgrade to 8GB. I initially raised the issue as question in https://github.com/openzfs/zfs/discussions/14986 but had too many variables in place to precisely pin the source of the issue (various external processes writing on it, some old zpools which used to have dedup enabled, a ton of snapshots, etc).

I've now managed to reproduce what I think is some sort of memory leak on a simple/vanilla setup, described below, by only using a single data source and rsnapshot. At this point though I'm at a loss and can't go any deeper - any help would be much appreciated.

Due to the (relative) simplicity of the setup I managed to replicate this on, I'm really surprised it's only me reporting.

Describe how to reproduce the problem

1 - I have a sample dataset composed as follows on a remote host 100 GB size, 100k directories, 250k files 2 - The zpool (hosted on a 4 vCPU, 8GB RAM VM) is relatively simple (zpool create -m /mnt/tank tank /dev/sda) and the FS on top of it is zfs create -o encryption=on -o keylocation=prompt -o keyformat=passphrase -o compression=on tank/secure (note both compression and encryption are on) 3 - rsnapshot is configured to backup the remote (see "1") to /mnt/tank/secure 4 - The source is never changed across iterations, so the rsync step of rsnapshot doesn't actually sync anything 5 - I run 10 iterations back to back, and see ARC filling up to its target 3/4GB size, as one would expect. Additionally, every iteration leaves about 600mb of "abandoned" used RAM

Here is what I'm left with:

               total        used        free      shared  buff/cache   available
Mem:            7951        7140         462           0         625         810
Swap:              0           0           0

 Active / Total Objects (% used)    : 13869528 / 15195981 (91.3%)
 Active / Total Slabs (% used)      : 629764 / 629764 (100.0%)
 Active / Total Caches (% used)     : 144 / 207 (69.6%)
 Active / Total Size (% used)       : 5961935.63K / 6291859.64K (94.8%)
 Minimum / Average / Maximum Object : 0.01K / 0.41K / 16.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
1259468 1259468 100%    1.12K  44981       28   1439392K zfs_znode_cache        
1262688 1262688 100%    0.97K  78918       16   1262688K dnode_t                
1614144 1358737  84%    0.50K 100884       16    807072K kmalloc-512            
 43460  43460 100%   16.00K  21730        2    695360K zio_buf_comb_16384     
1387428 1351895  97%    0.38K  66068       21    528544K dmu_buf_impl_t         
2478084 2024503  81%    0.19K 118004       21    472016K dentry                 
 39580  39580 100%    8.00K   9895        4    316640K kmalloc-8k             
1259472 1259472 100%    0.24K  78717       16    314868K sa_cache   

The same test, on the same system, with an ext4 over luks as destination works as intended and only results in 600mb of RAM used in total.

Include any warning/errors/backtraces from the system logs

Nothing relevant I can spot.

rincebrain commented 10 months ago

arc_summary output might be useful for seeing where ZFS thinks it's spending memory versus its expectations.

It might also be useful to see if 2.1.14 also has the same problem in the same config on Debian 11 to see if it's a change in OpenZFS release or kernel or both or neither.

gbonfiglio commented 10 months ago

Thanks, good call on arc_summary - it's here (this is on a fresh boot after my 10 iterations repro described above).

ARC doesn't seem to go beyond the (auto) configured 4GB ever, so it's not really my concern. The problem are the other 4 GB of used RAM when nothing is running: the root cause might not be ZFS itself, but some integration with the kernel, given the problem started when I upgraded from Debian 11 to Debian 12, with the the ZFS version remaining the same (since I went from 11-backports to 12).

I'll now do the following and report back: 1 - Downgrade ZFS on my Debian 12 to 2.1.11-1 (so I can have the same version on Debian 11) 2 - Run the test again, validate we have a repro (we will, since my initial report was on that very version) 3 - Install a fresh Debian 11 VM, with ZFS 2.1.11-1 from backports 4 - Move the zpool over 5 - Run the test. I should NOT have a repro (since the same setup was working for months before the upgrade)

rincebrain commented 10 months ago

I bet at least a quarter that "echo 0 | sudo tee /sys/module/zfs/parameters/zfs_prefetch_disable" before reproducing it will make it not reproduce.

gbonfiglio commented 10 months ago

Do you mean echo 1? zfs_prefetch_disable is already 0 both on my Debian 11 and Debian 12

rincebrain commented 10 months ago

Yes, sorry.

gbonfiglio commented 10 months ago

Ok, interesting enough I seem to also have a repro with Debian 11, Kernel 5.10.0-26 and ZFS 2.0.3-9. Which is where I started from pre-Debian upgrade on the original system and definitely didn't have such sympthoms.

OS Kernel ZFS Repro EXT4 Works
Debian 12 6.1.0-13 2.1.14-1 (debug) Yes Yes
Debian 12 6.1.0-13 2.1.11-1 (debug) Yes Yes
Debian 11 5.10.0-26 2.1.11-1 (debug) Yes Yes
Debian 11 5.10.0-26 2.0.3-9 (!) (debug) Yes Yes
gbonfiglio commented 10 months ago

Same results with disabled prefetch:

root@storebox-temp-d11:~# cat /sys/module/zfs/parameters/zfs_prefetch_disable
1
root@storebox-temp-d11:~# free -m
               total        used        free      shared  buff/cache   available
Mem:            7953        6903         488           0         562         735
Swap:              0           0           0

ARC here.

rincebrain commented 10 months ago

Ah well. Someone had an issue a bit ago where what seemed to be happening was things got locked into RAM the first time they were read and triggered prefetch until that first reader expired, so having a bunch of fast and slow readers intermingling could cause it to blow your ARC limits similar to this.

Hm.

filip-paczynski commented 10 months ago

Perhaps this is causing problems:

zfs_metaslab_mem_limit                                        75

Up to 75% of RAM can be used for metaslabs, with furher 50% reserved for ARC, this alone leaves you with -25% free memory - if I understand correctly.
On 2.1.X zfs_metaslab_mem_limit defaults to 25, not sure what it was on 2.0.x

rincebrain commented 10 months ago

I believe anything covered by zfs_metaslab_mem_limit would count against zfs_btree_leaf_cache, and which isn't visible in the top N caches list there, so probably not the issue at hand?

Also, the default changed from 75 to 25 in 86f5e0bbce, which went into 2.1.x as 5d7f47d8280d7eb6a61f9b9721c8d30e84c6f683.

65a commented 9 months ago

I might also be encountering this. I have several nodes which are not primarily storage oriented, but run ZFS filesystems. Starting somewhat recently, all of them have the ARC consuming a vast amount of memory, OOMing, and refusing to release ARC during memory pressure. I've had to manually specify zfs_arc_max for perhaps the first time, they are running zfs-2.2.99-270_gdbda45160f

gbonfiglio commented 1 month ago

I have made some progress in tracking this down, but not gotten to a resolution yet.

First of all an update on versions:

Bit ot a story here: after days of testing, the only way I could consistently replicate the issue was running rsnapshot to backup a source with a ton of files into a ZFS filesystem (any ZFS filesystem - have fiddled with compression, encryption, dedup, primarycache, secondarycache and the result is consistent), so went on with decomposing rsnapshot.

This is the source (roughly 700k between files and directories):

root@test:/opt# df -i
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
[...]
/dev/nvme1n1   9830400 731024 9099376    8% /opt/data

rsnapshot is a relatively simple piece of software which at every invocation (from cron):

  1. Removes the oldest backup copy (if it's beyond retention point): /bin/rm -rf /mnt/tank/secure/rsnapshot/hourly.71/
  2. Moves ("renames") all the previous snapshots up by 1:
    mv /mnt/tank/secure/rsnapshot/hourly.70/ /mnt/tank/secure/rsnapshot/hourly.71/
    [...]
    mv /mnt/tank/secure/rsnapshot/hourly.2/ /mnt/tank/secure/rsnapshot/hourly.3/
    mv /mnt/tank/secure/rsnapshot/hourly.1/ /mnt/tank/secure/rsnapshot/hourly.2/
  3. Duplicates the last snapshot to archive it up: /bin/cp -al /mnt/tank/secure/rsnapshot/hourly.0 /mnt/tank/secure/rsnapshot/hourly.1
  4. Syncs the remote into the latest snapshot: /usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded --rsh=/usr/bin/ssh test:/opt/data/ /mnt/tank/secure/rsnapshot/hourly.0/test/

The culprit seems to be 3. Memory is allocated during this step and never released. It's not ARC, but the issue only happens if the filesystem is ZFS (ext4 and zfs are fine).

Here's what happens in just three iterations (after a fresh boot, nothing else running on the VM).

At t0, ARC looks like this, memory looks like this:

               total        used        free      shared  buff/cache   available
Mem:            7948         408        7612           2         123        7539
Swap:              0           0           0

A more compact view:

root@storebox:/mnt/noec/main# grep MemFree /proc/meminfo && grep "^size" /proc/spl/kstat/zfs/arcstats
MemFree:         7788212 kB
size                            4    26283736

First iteration of the cp:

root@storebox:/mnt/noec/main# /bin/cp -al /mnt/noec/main/test /mnt/noec/main/test-d1

And we end up like this:

root@storebox:/mnt/noec/main# grep MemFree /proc/meminfo && grep "^size" /proc/spl/kstat/zfs/arcstats
MemFree:         1878148 kB
size                            4    3124229392

Note ARC went up from 26MB to about 3GB, but free memory went down from 7.8GB to 1.9GB. This means 2.9GB have just been blackholed somewhere.

After another three iterations (copying the same source in test-d2, test-d3, test-d4) we end up with 3.6GB of ARC and 545MB of free memory. About 4GB completely unaccounted for:

root@storebox:/mnt/noec/main# grep MemFree /proc/meminfo && grep "^size" /proc/spl/kstat/zfs/arcstats
MemFree:          544376 kB
size                            4    3596610672
root@storebox:/mnt/noec/main# free -m
               total        used        free      shared  buff/cache   available
Mem:            7948        7208         531           2         481         739
Swap:              0           0           0

ARC ends up like this.

Any suggestion here would be appreciated. I've tracked this as deep as my skills allow for at this point.

gbonfiglio commented 1 month ago

slabtop just in case:

root@storebox:~# slabtop -o --sort=c
 Active / Total Objects (% used)    : 14832319 / 15225297 (97.4%)
 Active / Total Slabs (% used)      : 656879 / 656879 (100.0%)
 Active / Total Caches (% used)     : 151 / 211 (71.6%)
 Active / Total Size (% used)       : 6751756.46K / 6849107.28K (98.6%)
 Minimum / Average / Maximum Object : 0.01K / 0.45K / 16.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
1364020 1364020 100%    1.11K  48715       28   1558880K zfs_znode_cache
1524032 1518497  99%    0.96K  95252       16   1524032K dnode_t
1786656 1678433  93%    0.50K 111666       16    893328K kmalloc-512
 48530  48340  99%   16.00K  24265        2    776480K zio_buf_comb_16384
1656560 1652037  99%    0.38K  82828       20    662624K dmu_buf_impl_t
 48528  48372  99%    8.00K  12132        4    388224K kmalloc-8k
1864086 1864086 100%    0.19K  88766       21    355064K dentry
1364048 1364048 100%    0.24K  85253       16    341012K sa_cache
2087360 1898182  90%    0.06K  32615       64    130460K kmalloc-64
1382182 1382182 100%    0.05K  18934       73     75736K ftrace_event_field
774774 774432  99%    0.09K  18447       42     73788K kmalloc-96
121888  98897  81%    0.50K   7618       16     60944K zio_buf_comb_512
207706 207706 100%    0.23K  12218       17     48872K arc_buf_hdr_t_full
 92379  92331  99%    0.19K   4399       21     17596K kmalloc-192
146133 134363  91%    0.10K   3747       39     14988K abd_t
447232 403571  90%    0.03K   3494      128     13976K kmalloc-32
  7920   7861  99%    1.00K    495       16      7920K kmalloc-1k
gbonfiglio commented 1 month ago

I'm ... officially lost. I tried to cap ARC to 1GB out of curiosity, to check if it resulted in a more clear cut slabtop - my expectation was to end up at the fourth run of the test with about 3.5GB of free memory (the 0.5GB previously left + the 3GB of saved ARC).

Instead, I end up with 5.4GB free. Used memory seems to be settling around 2x ARC.

root@storebox:~# free -m
               total        used        free      shared  buff/cache   available
Mem:            7948        2554        5416           2         223        5393
Swap:              0           0           0

slabtop:

 Active / Total Objects (% used)    : 3995106 / 4648353 (85.9%)
 Active / Total Slabs (% used)      : 202055 / 202055 (100.0%)
 Active / Total Caches (% used)     : 149 / 211 (70.6%)
 Active / Total Size (% used)       : 1940114.44K / 2091044.52K (92.8%)
 Minimum / Average / Maximum Object : 0.01K / 0.45K / 16.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
408324 408324 100%    1.11K  14583       28    466656K zfs_znode_cache
442528 434514  98%    0.96K  27658       16    442528K dnode_t
597040 444457  74%    0.50K  37315       16    298520K kmalloc-512
 14234  13858  97%   16.00K   7117        2    227744K zio_buf_comb_16384
482840 452033  93%    0.38K  24142       20    193136K dmu_buf_impl_t
643209 643209 100%    0.19K  30629       21    122516K dentry
 14032  13785  98%    8.00K   3508        4    112256K kmalloc-8k
408528 408528 100%    0.24K  25533       16    102132K sa_cache
607040 285961  47%    0.06K   9485       64     37940K kmalloc-64
425663 425663 100%    0.05K   5831       73     23324K ftrace_event_field
 36064   4586  12%    0.50K   2254       16     18032K zio_buf_comb_512
 48586  26537  54%    0.23K   2858       17     11432K arc_buf_hdr_t_full
 98532  96863  98%    0.09K   2346       42      9384K kmalloc-96
  7824   7820  99%    1.00K    489       16      7824K kmalloc-1k
 34398  34287  99%    0.19K   1638       21      6552K kmalloc-192
167936 113447  67%    0.03K   1312      128      5248K kmalloc-32
  1208   1145  94%    4.00K    151        8      4832K kmalloc-4k
 30300  30300 100%    0.13K   1010       30      4040K kernfs_node_cache

And arc_summary:

 ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   99.6 %  1019.9 MiB
        Target size (adaptive):                       100.0 %    1.0 GiB
        Min size (hard limit):                         24.3 %  248.4 MiB
        Max size (high water):                            4:1    1.0 GiB
        Anonymous data size:                            0.0 %    0 Bytes
        Anonymous metadata size:                        0.0 %    0 Bytes

If I up ARC to 2GB, memory usage after four rounds settles at 4.5GB.

In summary, my problem seems to be that ZFS is using exactly twice as much of whatever ARC's max limit is.

robn commented 1 month ago

Just to be clear: ARC memory is not the only memory used by OpenZFS. You should not expect that all memory used by OpenZFS will be under the ARC limit you set. That said, you are right that ARC should certainly be dominant.

Given the top two slabs by some distance are zfs_znode_cache and dnode_t, I expect you're seeing #15511. This is fixed on master, but not on 2.2.x.

rincebrain commented 1 month ago

I wonder if a trivial counter to include all the non-ARC allocations ZFS makes on Linux might be a useful thing to expose.

gbonfiglio commented 1 month ago

@robn thanks. Does it make sense to you this started becoming an issue only with Kernel 6? This VM went with constant 1.2/1.3GB used out of 2GB to constant OOM with the Debian upgrade. I was already running OpenZFS from backports so the only main version changed with the upgrade is the kernel's.

amotin commented 1 month ago

@gbonfiglio I haven't looked too deep when exactly it happened, but we've noticed that introduction of MGLRU eviction mechanisms in kernel changed behavior a lot, though there could be other changes also. Combined with ZFS is not properly reacting on kernel's shrinker requests, we found a number of OOM reports ourselves. Recently I've made several patches to ZFS master to make it more reactive, plus we set zfs_arc_shrinker_limit to 0, which together, I hope, should result in better behavior. Though now I sometimes see ARC shrinking a bit too aggressively on kernel requests, that I need to investigate deeper, but it should be better than OOMs.

gbonfiglio commented 1 month ago

I've just tried building packages from master again and this time it seems to solve my issue (in December it didn't). Thankyou!

After 8 iterations of cp I end up like this:

root@zfs-lab:~# free -m
               total        used        free      shared  buff/cache   available
Mem:            7939        3767         384           2        4063        4172
Swap:              0           0           0

(note the 4GB of freeable buff/cache)

Will the bulk of those changes ever make it to 2.2, and if so you have an idea of when more or less? I know you people are really confident on your master but one of my two affected systems holds some quite importand data and I'd rather take something more mainstream such as deb backports.