Open gbonfiglio opened 10 months ago
arc_summary
output might be useful for seeing where ZFS thinks it's spending memory versus its expectations.
It might also be useful to see if 2.1.14 also has the same problem in the same config on Debian 11 to see if it's a change in OpenZFS release or kernel or both or neither.
Thanks, good call on arc_summary - it's here (this is on a fresh boot after my 10 iterations repro described above).
ARC doesn't seem to go beyond the (auto) configured 4GB ever, so it's not really my concern. The problem are the other 4 GB of used RAM when nothing is running: the root cause might not be ZFS itself, but some integration with the kernel, given the problem started when I upgraded from Debian 11 to Debian 12, with the the ZFS version remaining the same (since I went from 11-backports to 12).
I'll now do the following and report back:
1 - Downgrade ZFS on my Debian 12 to 2.1.11-1
(so I can have the same version on Debian 11)
2 - Run the test again, validate we have a repro (we will, since my initial report was on that very version)
3 - Install a fresh Debian 11 VM, with ZFS 2.1.11-1
from backports
4 - Move the zpool over
5 - Run the test. I should NOT have a repro (since the same setup was working for months before the upgrade)
I bet at least a quarter that "echo 0 | sudo tee /sys/module/zfs/parameters/zfs_prefetch_disable" before reproducing it will make it not reproduce.
Do you mean echo 1
? zfs_prefetch_disable
is already 0
both on my Debian 11 and Debian 12
Yes, sorry.
Ok, interesting enough I seem to also have a repro with Debian 11, Kernel 5.10.0-26 and ZFS 2.0.3-9. Which is where I started from pre-Debian upgrade on the original system and definitely didn't have such sympthoms.
OS | Kernel | ZFS | Repro | EXT4 Works |
---|---|---|---|---|
Debian 12 | 6.1.0-13 | 2.1.14-1 (debug) | Yes | Yes |
Debian 12 | 6.1.0-13 | 2.1.11-1 (debug) | Yes | Yes |
Debian 11 | 5.10.0-26 | 2.1.11-1 (debug) | Yes | Yes |
Debian 11 | 5.10.0-26 | 2.0.3-9 (!) (debug) | Yes | Yes |
Same results with disabled prefetch:
root@storebox-temp-d11:~# cat /sys/module/zfs/parameters/zfs_prefetch_disable
1
root@storebox-temp-d11:~# free -m
total used free shared buff/cache available
Mem: 7953 6903 488 0 562 735
Swap: 0 0 0
ARC here.
Ah well. Someone had an issue a bit ago where what seemed to be happening was things got locked into RAM the first time they were read and triggered prefetch until that first reader expired, so having a bunch of fast and slow readers intermingling could cause it to blow your ARC limits similar to this.
Hm.
Perhaps this is causing problems:
zfs_metaslab_mem_limit 75
Up to 75% of RAM can be used for metaslabs, with furher 50% reserved for ARC, this alone leaves you with -25% free memory - if I understand correctly.
On 2.1.X zfs_metaslab_mem_limit
defaults to 25, not sure what it was on 2.0.x
I believe anything covered by zfs_metaslab_mem_limit
would count against zfs_btree_leaf_cache, and which isn't visible in the top N caches list there, so probably not the issue at hand?
Also, the default changed from 75 to 25 in 86f5e0bbce, which went into 2.1.x as 5d7f47d8280d7eb6a61f9b9721c8d30e84c6f683.
I might also be encountering this. I have several nodes which are not primarily storage oriented, but run ZFS filesystems. Starting somewhat recently, all of them have the ARC consuming a vast amount of memory, OOMing, and refusing to release ARC during memory pressure. I've had to manually specify zfs_arc_max for perhaps the first time, they are running zfs-2.2.99-270_gdbda45160f
I have made some progress in tracking this down, but not gotten to a resolution yet.
First of all an update on versions:
Debian 12
6.9.7+bpo-amd64
(but stock 6.1.0-23-amd64
is still affected too)2.2.4-1~bpo12+1
Bit ot a story here: after days of testing, the only way I could consistently replicate the issue was running rsnapshot to backup a source with a ton of files into a ZFS filesystem (any ZFS filesystem - have fiddled with compression
, encryption
, dedup
, primarycache
, secondarycache
and the result is consistent), so went on with decomposing rsnapshot.
This is the source (roughly 700k between files and directories):
root@test:/opt# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
[...]
/dev/nvme1n1 9830400 731024 9099376 8% /opt/data
rsnapshot is a relatively simple piece of software which at every invocation (from cron):
/bin/rm -rf /mnt/tank/secure/rsnapshot/hourly.71/
mv /mnt/tank/secure/rsnapshot/hourly.70/ /mnt/tank/secure/rsnapshot/hourly.71/
[...]
mv /mnt/tank/secure/rsnapshot/hourly.2/ /mnt/tank/secure/rsnapshot/hourly.3/
mv /mnt/tank/secure/rsnapshot/hourly.1/ /mnt/tank/secure/rsnapshot/hourly.2/
/bin/cp -al /mnt/tank/secure/rsnapshot/hourly.0 /mnt/tank/secure/rsnapshot/hourly.1
/usr/bin/rsync -a --delete --numeric-ids --relative --delete-excluded --rsh=/usr/bin/ssh test:/opt/data/ /mnt/tank/secure/rsnapshot/hourly.0/test/
The culprit seems to be 3. Memory is allocated during this step and never released. It's not ARC, but the issue only happens if the filesystem is ZFS (ext4 and zfs are fine).
Here's what happens in just three iterations (after a fresh boot, nothing else running on the VM).
At t0, ARC looks like this, memory looks like this:
total used free shared buff/cache available
Mem: 7948 408 7612 2 123 7539
Swap: 0 0 0
A more compact view:
root@storebox:/mnt/noec/main# grep MemFree /proc/meminfo && grep "^size" /proc/spl/kstat/zfs/arcstats
MemFree: 7788212 kB
size 4 26283736
First iteration of the cp
:
root@storebox:/mnt/noec/main# /bin/cp -al /mnt/noec/main/test /mnt/noec/main/test-d1
And we end up like this:
root@storebox:/mnt/noec/main# grep MemFree /proc/meminfo && grep "^size" /proc/spl/kstat/zfs/arcstats
MemFree: 1878148 kB
size 4 3124229392
Note ARC went up from 26MB to about 3GB, but free memory went down from 7.8GB to 1.9GB. This means 2.9GB have just been blackholed somewhere.
After another three iterations (copying the same source in test-d2, test-d3, test-d4
) we end up with 3.6GB of ARC and 545MB of free memory. About 4GB completely unaccounted for:
root@storebox:/mnt/noec/main# grep MemFree /proc/meminfo && grep "^size" /proc/spl/kstat/zfs/arcstats
MemFree: 544376 kB
size 4 3596610672
root@storebox:/mnt/noec/main# free -m
total used free shared buff/cache available
Mem: 7948 7208 531 2 481 739
Swap: 0 0 0
ARC ends up like this.
Any suggestion here would be appreciated. I've tracked this as deep as my skills allow for at this point.
slabtop just in case:
root@storebox:~# slabtop -o --sort=c
Active / Total Objects (% used) : 14832319 / 15225297 (97.4%)
Active / Total Slabs (% used) : 656879 / 656879 (100.0%)
Active / Total Caches (% used) : 151 / 211 (71.6%)
Active / Total Size (% used) : 6751756.46K / 6849107.28K (98.6%)
Minimum / Average / Maximum Object : 0.01K / 0.45K / 16.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
1364020 1364020 100% 1.11K 48715 28 1558880K zfs_znode_cache
1524032 1518497 99% 0.96K 95252 16 1524032K dnode_t
1786656 1678433 93% 0.50K 111666 16 893328K kmalloc-512
48530 48340 99% 16.00K 24265 2 776480K zio_buf_comb_16384
1656560 1652037 99% 0.38K 82828 20 662624K dmu_buf_impl_t
48528 48372 99% 8.00K 12132 4 388224K kmalloc-8k
1864086 1864086 100% 0.19K 88766 21 355064K dentry
1364048 1364048 100% 0.24K 85253 16 341012K sa_cache
2087360 1898182 90% 0.06K 32615 64 130460K kmalloc-64
1382182 1382182 100% 0.05K 18934 73 75736K ftrace_event_field
774774 774432 99% 0.09K 18447 42 73788K kmalloc-96
121888 98897 81% 0.50K 7618 16 60944K zio_buf_comb_512
207706 207706 100% 0.23K 12218 17 48872K arc_buf_hdr_t_full
92379 92331 99% 0.19K 4399 21 17596K kmalloc-192
146133 134363 91% 0.10K 3747 39 14988K abd_t
447232 403571 90% 0.03K 3494 128 13976K kmalloc-32
7920 7861 99% 1.00K 495 16 7920K kmalloc-1k
I'm ... officially lost. I tried to cap ARC to 1GB out of curiosity, to check if it resulted in a more clear cut slabtop - my expectation was to end up at the fourth run of the test with about 3.5GB of free memory (the 0.5GB previously left + the 3GB of saved ARC).
Instead, I end up with 5.4GB free. Used memory seems to be settling around 2x ARC.
root@storebox:~# free -m
total used free shared buff/cache available
Mem: 7948 2554 5416 2 223 5393
Swap: 0 0 0
slabtop:
Active / Total Objects (% used) : 3995106 / 4648353 (85.9%)
Active / Total Slabs (% used) : 202055 / 202055 (100.0%)
Active / Total Caches (% used) : 149 / 211 (70.6%)
Active / Total Size (% used) : 1940114.44K / 2091044.52K (92.8%)
Minimum / Average / Maximum Object : 0.01K / 0.45K / 16.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
408324 408324 100% 1.11K 14583 28 466656K zfs_znode_cache
442528 434514 98% 0.96K 27658 16 442528K dnode_t
597040 444457 74% 0.50K 37315 16 298520K kmalloc-512
14234 13858 97% 16.00K 7117 2 227744K zio_buf_comb_16384
482840 452033 93% 0.38K 24142 20 193136K dmu_buf_impl_t
643209 643209 100% 0.19K 30629 21 122516K dentry
14032 13785 98% 8.00K 3508 4 112256K kmalloc-8k
408528 408528 100% 0.24K 25533 16 102132K sa_cache
607040 285961 47% 0.06K 9485 64 37940K kmalloc-64
425663 425663 100% 0.05K 5831 73 23324K ftrace_event_field
36064 4586 12% 0.50K 2254 16 18032K zio_buf_comb_512
48586 26537 54% 0.23K 2858 17 11432K arc_buf_hdr_t_full
98532 96863 98% 0.09K 2346 42 9384K kmalloc-96
7824 7820 99% 1.00K 489 16 7824K kmalloc-1k
34398 34287 99% 0.19K 1638 21 6552K kmalloc-192
167936 113447 67% 0.03K 1312 128 5248K kmalloc-32
1208 1145 94% 4.00K 151 8 4832K kmalloc-4k
30300 30300 100% 0.13K 1010 30 4040K kernfs_node_cache
And arc_summary:
ARC status: HEALTHY
Memory throttle count: 0
ARC size (current): 99.6 % 1019.9 MiB
Target size (adaptive): 100.0 % 1.0 GiB
Min size (hard limit): 24.3 % 248.4 MiB
Max size (high water): 4:1 1.0 GiB
Anonymous data size: 0.0 % 0 Bytes
Anonymous metadata size: 0.0 % 0 Bytes
If I up ARC to 2GB, memory usage after four rounds settles at 4.5GB.
In summary, my problem seems to be that ZFS is using exactly twice as much of whatever ARC's max limit is.
Just to be clear: ARC memory is not the only memory used by OpenZFS. You should not expect that all memory used by OpenZFS will be under the ARC limit you set. That said, you are right that ARC should certainly be dominant.
Given the top two slabs by some distance are zfs_znode_cache
and dnode_t
, I expect you're seeing #15511. This is fixed on master, but not on 2.2.x.
I wonder if a trivial counter to include all the non-ARC allocations ZFS makes on Linux might be a useful thing to expose.
@robn thanks. Does it make sense to you this started becoming an issue only with Kernel 6? This VM went with constant 1.2/1.3GB used out of 2GB to constant OOM with the Debian upgrade. I was already running OpenZFS from backports so the only main version changed with the upgrade is the kernel's.
@gbonfiglio I haven't looked too deep when exactly it happened, but we've noticed that introduction of MGLRU eviction mechanisms in kernel changed behavior a lot, though there could be other changes also. Combined with ZFS is not properly reacting on kernel's shrinker requests, we found a number of OOM reports ourselves. Recently I've made several patches to ZFS master to make it more reactive, plus we set zfs_arc_shrinker_limit to 0, which together, I hope, should result in better behavior. Though now I sometimes see ARC shrinking a bit too aggressively on kernel requests, that I need to investigate deeper, but it should be better than OOMs.
I've just tried building packages from master again and this time it seems to solve my issue (in December it didn't). Thankyou!
After 8 iterations of cp
I end up like this:
root@zfs-lab:~# free -m
total used free shared buff/cache available
Mem: 7939 3767 384 2 4063 4172
Swap: 0 0 0
(note the 4GB of freeable buff/cache)
Will the bulk of those changes ever make it to 2.2, and if so you have an idea of when more or less? I know you people are really confident on your master but one of my two affected systems holds some quite importand data and I'd rather take something more mainstream such as deb backports.
System information
Describe the problem you're observing
Since upgrading to Debian 12 a small system I run as rsync target and rsnapshot host went from a flat 50% RAM usage (out of 2GB total) to constantly OOM'ing even after an upgrade to 8GB. I initially raised the issue as question in https://github.com/openzfs/zfs/discussions/14986 but had too many variables in place to precisely pin the source of the issue (various external processes writing on it, some old zpools which used to have dedup enabled, a ton of snapshots, etc).
I've now managed to reproduce what I think is some sort of memory leak on a simple/vanilla setup, described below, by only using a single data source and rsnapshot. At this point though I'm at a loss and can't go any deeper - any help would be much appreciated.
Due to the (relative) simplicity of the setup I managed to replicate this on, I'm really surprised it's only me reporting.
Describe how to reproduce the problem
1 - I have a sample dataset composed as follows on a remote host 100 GB size, 100k directories, 250k files 2 - The zpool (hosted on a 4 vCPU, 8GB RAM VM) is relatively simple (
zpool create -m /mnt/tank tank /dev/sda
) and the FS on top of it iszfs create -o encryption=on -o keylocation=prompt -o keyformat=passphrase -o compression=on tank/secure
(note both compression and encryption are on) 3 - rsnapshot is configured to backup the remote (see "1") to/mnt/tank/secure
4 - The source is never changed across iterations, so the rsync step of rsnapshot doesn't actually sync anything 5 - I run 10 iterations back to back, and see ARC filling up to its target 3/4GB size, as one would expect. Additionally, every iteration leaves about 600mb of "abandoned" used RAMHere is what I'm left with:
The same test, on the same system, with an ext4 over luks as destination works as intended and only results in 600mb of RAM used in total.
Include any warning/errors/backtraces from the system logs
Nothing relevant I can spot.