Open Frank030366 opened 1 year ago
That stack trace is you triggering an OOM from the ext4 code, not a kernel panic.
ZFS's memory usage there doesn't seem absurd, from the output, but there seems to be a lot of RAM not covered by the output there if it's got 32 GIB in total.
Could you share more of the output, ideally the first 200 or so lines after it starts to go off the rails?
e: also cat /proc/slabinfo
over time would be good to capture. My guess here would be, since it's reporting a lot of memory usage in unreclaimable slabs but the math doesn't seem to add up there, assuming I'm not missing a zero somewhere, that memory fragmentation is biting you, and you're making and (sometimes) freeing lots of small allocations, then when you actually want a contiguous 1M chunk (or some similarly large one) for something, and it's an allocation that can't fail, we're now running around killing things trying to free enough, but it still isn't winding up finding a large enough contiguous chunk anywhere.
In theory, I think the OOM messages should print somewhere around when they start how much they're trying to find, so that'd be a hint.
No problem - here is file attached. It's a dmesg dump created by kdump-tools.
That stack trace is you triggering an OOM from the ext4 code, not a kernel panic.
ZFS's memory usage there doesn't seem absurd, from the output, but there seems to be a lot of RAM not covered by the output there if it's got 32 GIB in total.
Could you share more of the output, ideally the first 200 or so lines after it starts to go off the rails?
That's odd - the first thing it mentions is running out of RAM trying to fork(), essentially, unless I'm really bad at kernel code, which should be a very small allocation, comparatively...
There's also almost no binaries running. That's very strange indeed. I'll try to reproduce this later when I'm home with my nice testbeds.
As an addition to previous info I've compiled several ZFS versions from release source tarballs and tested it and it seems that this bug first "appeared" in version 2.1.6 - because on versions 2.1.2-2.1.5 test runs without any issues but on 2.1.6 (and on 2.1.7 and 2.1.8 as well) it crashes the same way as on 2.1.9. Maybe it can help to find the root cause...
My wild guess would have been an interaction with #12746 but that was 2.1.5.
Simple enough to bisect assuming it reproduces readily, I suppose.
A bisection would be great. I personally would suspect https://github.com/openzfs/zfs/pull/13794 as the direct origin of the symptoms, as it changes how ZFS frees up memory. If so, that may mean that there is some other underlying memory accounting/nonfreeing/fragmentation issue and that this just reveals it…
Returning back here from very long session of testing various builds. Sorry for giving (as I've detected later) wrong info about where bug is probably first appeared - more deeper testing with bonnie++ n passes detected another picture. First I've built several release tarball versions and detected after tests that the latest "clean" version is 2.0.5, 2.0.6 is broken already. Second - I've got a sequence of commits between that versions (https://github.com/openzfs/zfs/compare/zfs-2.0.5...zfs-2.0.6) and found (through simple test driven) a commit which leads to crashes, here it is: https://github.com/openzfs/zfs/commit/7a41ef240a99f247d467e1bfa676b855d361174d I really don't have a clue how this commit can trigger such issue but I've checked 2 times: with build from previous commit (https://github.com/openzfs/zfs/commit/72d16a9b49f54a44a471f881d767621e2bfbe913) everything goes fine and with the next one I've mentioned above my test starts crashing. I haven't enough knowledge of this code to explain why it goes this way, but tests shows that this commit is triggering kernel panic issue... P.S. Oh crap... After 8 pass testing instead of previous 4 pass commit https://github.com/openzfs/zfs/commit/72d16a9b49f54a44a471f881d767621e2bfbe913 crashed again... Will keep at least 10 pass test running with 2.0.5 release to make sure that it's stable...
Even 2.0.3 version crashing from time to time - so it's really unclear which change is root cause. In addition I see strange output while getting info from kernel crash dump - kmem -i
shows that all memory consumed, but kmem -s
isn't give an answer who is consuming it (dumped after zfs 2.1.9 crash):
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 8105286 30.9 GB ----
FREE 50721 198.1 MB 0% of TOTAL MEM
USED 8054565 30.7 GB 99% of TOTAL MEM
SHARED 5214 20.4 MB 0% of TOTAL MEM
BUFFERS 78 312 KB 0% of TOTAL MEM
CACHED -45 70368744177664 GB 2275890582233% of TOTAL MEM
SLAB 115517 451.2 MB 1% of TOTAL MEM
TOTAL HUGE 0 0 ----
HUGE FREE 0 0 0% of TOTAL HUGE
TOTAL SWAP 249855 976 MB ----
SWAP USED 3328 13 MB 1% of TOTAL SWAP
SWAP FREE 246527 963 MB 98% of TOTAL SWAP
COMMIT LIMIT 4302498 16.4 GB ----
COMMITTED 508454 1.9 GB 11% of TOTAL LIMIT
crash> kmem -s | awk '{print $1,$2, $5*$6"k", $7}' | sort -nrk3 | column -t | head
ffff93d344787000 1248 1040416k zio_cache
ffff93d344787e00 512 181600k zio_buf_comb_512
ffff93d340043400 2048 139488k kmalloc-2k
ffff93d340043300 4096 73888k kmalloc-4k
ffff93d340043700 256 59728k kmalloc-256
ffff93d340043800 192 45728k kmalloc-192
ffff93d344786e00 48 39792k zio_link_cache
ffff93d344786a00 104 38120k abd_t
ffff93d340043c00 32 31720k kmalloc-32
ffff93d340043a00 96 29020k kmalloc-96
Is there a version after which this is much more easily triggered? For example, it is true it happens ~quickly on 2.1.6+, and much more rarely on 2.0.3-2.1.5? I strongly suspect multiple issues at play.
That negative cache value is funny — I guess some tabulation race condition from being under memory pressure and constantly dropping caches and things.
Is there a version after which this is much more easily triggered? For example, it is true it happens ~quickly on 2.1.6+, and much more rarely on 2.0.3-2.1.5? I strongly suspect multiple issues at play.
That negative cache value is funny — I guess some tabulation race condition from being under memory pressure and constantly dropping caches and things.
Yeah, on 2.1.9 this issue triggered quickly after 1-2 tests and in addition not only with bonnie++ - for example this fio test trigger it as well: cd /testpool && fio --name rw --rw rw --size 2G
In addition I was able to reproduce this issue even on VMware virtual machine. I've prepared an OVF with all development tools required to compile a ZFS release and prepared ZFS pool with required settings and script to execute the test, kdump-tools is configured as well to produce a crash dump: https://drive.google.com/drive/folders/10lb9pjK23BHiezq08NveaRhJ0bddw5XK All need to do to reproduce the issue is just start the VM, login to root account (pass 12345678) and execute zfs-crash-test.sh script, ZFS 2.1.9 is used.
After further several tests I've found one more thing: this issue stops reproducing if I change logbias setting from "throughput" to "latency". Let's write here initial settings from issue description:
zfs set atime=off relatime=on xattr=sa dnodesize=auto logbias=throughput sync=standard dedup=off recordsize=1M compression=off testpool
With this settings we have a kernel panic after testing, and after just apply one setting below tests are going fine even with 1M recordsize:
zfs set logbias=latency testpool
I've maid a quick search in source code for 2.1.9 version and found that only one method behavior depends on this setting actually and it's "zfs_log_write" - here it is: https://github.com/openzfs/zfs/blob/zfs-2.1.9/module/zfs/zfs_log.c#L532 I'm not familiar with ZFS code, but at least this place seems strange for me: https://github.com/openzfs/zfs/blob/zfs-2.1.9/module/zfs/zfs_log.c#L586 Why we use "len" variable as additional increment value here when wr_state is WR_COPIED although len actually changed when wr_state is WR_INDIRECT? Maybe it's just my dumb question but... Anyway could you pls guys take a look at this portion of code? Because simple tests shows that this issue root cause probably in this method behavior.
P.S. Right now I've compiled ZFS packages from 2.1.9 tag with only one change in https://github.com/openzfs/zfs/blob/zfs-2.1.9/module/zfs/zfs_log.c#L586 - changed (wr_state == WR_COPIED ? len : 0));
to (wr_state == WR_INDIRECT ? len : 0));
- and crash test runs fine even with logbias set to throughput and recordsize 1M. Can somebody with more knowledge than me (because the only thing I can do is just dumb test driven like methods to find the root cause) here review this code and give more authoritative opinion - am I on the right way or not?
That change will cause an out of bounds memory writes on synchronous writes less than immediate_sz in size on line 595, so is not what you want. The fact that allocating a larger itx fixes the issue for wr_indirect records suggests an issue with that codepath writing beyond the record (when no additional length is allocated), but I don’t immediately see the bug that would cause that.
That change will cause an out of bounds memory writes on synchronous writes less than immediate_sz in size on line 595, so is not what you want. The fact that allocating a larger itx fixes the issue for wr_indirect records suggests an issue with that codepath writing beyond the record (when no additional length is allocated), but I don’t immediately see the bug that would cause that.
I've got your point but in the case of writing beyond the record we're usually get an access violation error or something similar while in this case I get kernel panic due to out of memory - usually it happens when code allocates large amount of memory but not free it, correct me if I'm wrong pls.
Well, one possibility is there aren’t any OOB writes and that your patch works by making the itx allocations larger, helping prevent memory fragmentation. If you are going to keep using your patch, I would change it to just unconditionally be len (rather than len or 0), to avoid an issue with small synchronous writes.
it would be great for a maintainer to chime in here since you’ve found a great reproducer.
System information
Describe the problem you're observing
System getting kernel panic everytime when running simple bonnie++ test on ZFS filesystem (not zvol) with recordsize set to 1M (512K affected as well). If setting recordsize to 256K or lower test is running fine. This test was performed on bare metal hardware (Intel i3 12 gen, 32GB RAM), but I was able to reproduce it on VMware virtual machine as well.
Describe how to reproduce the problem
apt-get update && apt-get -y dist-upgrade
);/etc/modprobe.d/zfs.conf
;zpool create -o ashift=12 testpool raidz2 /dev/disk/by-id/<disk1_name> /dev/disk/by-id/<disk8_name>
;zfs set atime=off relatime=on xattr=sa dnodesize=auto logbias=throughput sync=standard dedup=off recordsize=1M compression=off testpool
;useradd frank && cd /testpool && mkdir bon && chmod 777 bon
;bonnie++ -d /testpool/bon/ -x 1 -u frank
.Include any warning/errors/backtraces from the system logs
ZFS pool config (last characters obfuscated):
Disk info (all 8 HDD are idetical):
Dmesg trace (last 180 lines):