openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.73k forks source link

Kernel panic while running bonnie++ test on ZFS with recordsize 1M #14732

Open Frank030366 opened 1 year ago

Frank030366 commented 1 year ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version 11.6
Kernel Version 5.10.0-21-amd64
Architecture x86_64
OpenZFS Version zfs-2.1.9-1~bpo11+1

Describe the problem you're observing

System getting kernel panic everytime when running simple bonnie++ test on ZFS filesystem (not zvol) with recordsize set to 1M (512K affected as well). If setting recordsize to 256K or lower test is running fine. This test was performed on bare metal hardware (Intel i3 12 gen, 32GB RAM), but I was able to reproduce it on VMware virtual machine as well.

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

ZFS pool config (last characters obfuscated):

root@debian-test:/var/crash/202304101704# zpool status
  pool: testpool
 state: ONLINE
config:

        NAME                                          STATE     READ WRITE CKSUM
        testpool                                      ONLINE       0     0     0
          raidz2-0                                    ONLINE       0     0     0
            ata-WDC_WD20EFZX-68AWUN0_WD-WXxxxxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD20EFZX-68AWUN0_WD-WXxxxxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD20EFZX-68AWUN0_WD-WXxxxxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD20EFZX-68AWUN0_WD-WXxxxxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD20EFZX-68AWUN0_WD-WXxxxxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD20EFZX-68AWUN0_WD-WXxxxxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD20EFZX-68AWUN0_WD-WXxxxxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD20EFZX-68AWUN0_WD-WXxxxxxxxxxx  ONLINE       0     0     0

errors: No known data errors

Disk info (all 8 HDD are idetical):

Disk /dev/sdc: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC WD20EFZX-68A
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Dmesg trace (last 180 lines):

[  132.319785]  ext4_filemap_fault+0x2d/0x50 [ext4]
[  132.319787]  __do_fault+0x34/0x170
[  132.319788]  handle_mm_fault+0x124f/0x1c00
[  132.319790]  do_user_addr_fault+0x1b8/0x400
[  132.319791]  exc_page_fault+0x78/0x160
[  132.319793]  ? asm_exc_page_fault+0x8/0x30
[  132.319793]  asm_exc_page_fault+0x1e/0x30
[  132.319795] RIP: 0033:0x7f4cfee190f0
[  132.319798] Code: Unable to access opcode bytes at RIP 0x7f4cfee190c6.
[  132.319798] RSP: 002b:00007ffee89a8638 EFLAGS: 00010202
[  132.319799] RAX: 00007f4cff1a1ca0 RBX: 000056369cfd1040 RCX: 000056369cfc4fa0
[  132.319800] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000128
[  132.319800] RBP: 0000000000000000 R08: 3ba4b432781f6494 R09: 80d80bc654633720
[  132.319801] R10: e5041ae37db69227 R11: 0000000000000000 R12: 00007ffee89a86a0
[  132.319801] R13: 0000000000000008 R14: 0000000000000001 R15: 00007f4cff3ce568
[  132.319802] Mem-Info:
[  132.319804] active_anon:2 inactive_anon:0 isolated_anon:0
                active_file:106 inactive_file:52 isolated_file:0
                unevictable:0 dirty:0 writeback:0
                slab_reclaimable:4344 slab_unreclaimable:682761
                mapped:1 shmem:2 pagetables:127 bounce:0
                free:49439 free_pcp:682 free_cma:0
[  132.319805] Node 0 active_anon:8kB inactive_anon:0kB active_file:424kB inactive_file:208kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4kB dirty:0kB writeback:0kB shmem:8kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:3680kB all_unreclaimable? yes
[  132.319805] Node 0 DMA free:11264kB min:28kB low:40kB high:52kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  132.319807] lowmem_reserve[]: 0 593 31736 31736 31736
[  132.319809] Node 0 DMA32 free:125780kB min:1260kB low:1864kB high:2468kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:835288kB managed:638660kB mlocked:0kB pagetables:4kB bounce:0kB free_pcp:240kB local_pcp:0kB free_cma:0kB
[  132.319810] lowmem_reserve[]: 0 0 31143 31143 31143
[  132.319811] Node 0 Normal free:60712kB min:66288kB low:98176kB high:130064kB reserved_highatomic:0KB active_anon:8kB inactive_anon:0kB active_file:92kB inactive_file:0kB unevictable:0kB writepending:0kB present:32501760kB managed:31898196kB mlocked:0kB pagetables:504kB bounce:0kB free_pcp:2488kB local_pcp:0kB free_cma:0kB
[  132.319813] lowmem_reserve[]: 0 0 0 0 0
[  132.319814] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
[  132.319818] Node 0 DMA32: 4*4kB (M) 4*8kB (UM) 17*16kB (UM) 4*32kB (UM) 5*64kB (UM) 5*128kB (UM) 4*256kB (UM) 5*512kB (UM) 4*1024kB (M) 5*2048kB (UM) 26*4096kB (M) = 125824kB
[  132.319824] Node 0 Normal: 50*4kB (UME) 36*8kB (UE) 101*16kB (UME) 123*32kB (UME) 112*64kB (UME) 164*128kB (UME) 29*256kB (UM) 17*512kB (M) 10*1024kB (M) 0*2048kB 0*4096kB = 60568kB
[  132.319830] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  132.319830] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  132.319830] 129 total pagecache pages
[  132.319831] 0 pages in swap cache
[  132.319831] Swap cache stats: add 5730, delete 5730, find 125/290
[  132.319832] Free swap  = 988924kB
[  132.319832] Total swap = 999420kB
[  132.319832] 8338260 pages RAM
[  132.319833] 0 pages HighMem/MovableOnly
[  132.319833] 200206 pages reserved
[  132.319833] 0 pages hwpoisoned
[  132.319834] Unreclaimable slab info:
[  132.319834] Name                      Used          Total
[  132.319838] zfs_znode_hold_cache         27KB         27KB
[  132.319838] zfs_znode_cache           62KB         62KB
[  132.319839] zil_zcw_cache              3KB          3KB
[  132.319840] zil_lwb_cache            354KB        354KB
[  132.319840] dmu_buf_impl_t           110KB        110KB
[  132.319841] arc_buf_t                 31KB         31KB
[  132.319842] arc_buf_hdr_t_full         70KB         70KB
[  132.319842] dnode_t                  255KB        255KB
[  132.319843] sa_cache                  15KB         15KB
[  132.319844] abd_t                  61776KB      61790KB
[  132.319845] lz4_cache                256KB        256KB
[  132.319845] zio_buf_comb_16384        736KB        736KB
[  132.319846] zio_buf_comb_14336         32KB         32KB
[  132.319846] zio_buf_comb_12288         48KB         48KB
[  132.319846] zio_buf_comb_10240         24KB         24KB
[  132.319847] zio_buf_comb_8192         32KB         32KB
[  132.319847] zio_buf_comb_7168         32KB         32KB
[  132.319847] zio_buf_comb_6144         32KB         32KB
[  132.319848] zio_buf_comb_5120         32KB         32KB
[  132.319849] zio_buf_comb_4096        580KB        768KB
[  132.319849] zio_buf_comb_3584         31KB         31KB
[  132.319850] zio_buf_comb_3072        150KB        150KB
[  132.319850] zio_buf_comb_2560         30KB         30KB
[  132.319851] zio_buf_comb_2048         64KB         64KB
[  132.319851] zio_buf_comb_1536        157KB        157KB
[  132.319851] zio_buf_comb_1024         64KB         64KB
[  132.319852] zio_buf_comb_512      597680KB     597760KB
[  132.319853] zio_link_cache         52963KB      52996KB
[  132.319858] zio_cache            1372563KB    1373978KB
[  132.319859] zfs_btree_leaf_cache        808KB        832KB
[  132.319860] mod_hash_entries           3KB          3KB
[  132.319861] ext4_system_zone          11KB         11KB
[  132.319861] scsi_sense_cache        1072KB       1072KB
[  132.319862] RAWv6                     61KB         61KB
[  132.319863] mqueue_inode_cache         31KB         31KB
[  132.319864] UNIX                     256KB        256KB
[  132.319864] RAW                       64KB         64KB
[  132.319865] UDP                      159KB        159KB
[  132.319865] request_sock_TCP           7KB          7KB
[  132.319866] TCP                       61KB         61KB
[  132.319866] hugetlbfs_inode_cache         30KB         30KB
[  132.319867] eventpoll_pwq             31KB         31KB
[  132.319867] request_queue            189KB        189KB
[  132.319868] biovec-max              1184KB       1248KB
[  132.319869] biovec-128               384KB        384KB
[  132.319869] biovec-64                384KB        384KB
[  132.319869] khugepaged_mm_slot         31KB         31KB
[  132.319870] dmaengine-unmap-256         30KB         30KB
[  132.319870] dmaengine-unmap-128         31KB         31KB
[  132.319871] dmaengine-unmap-16        110KB        110KB
[  132.319871] skbuff_fclone_cache        128KB        128KB
[  132.319872] skbuff_head_cache        216KB        248KB
[  132.319873] file_lock_cache           62KB         62KB
[  132.319873] fsnotify_mark_connector         32KB         32KB
[  132.319873] net_namespace             30KB         30KB
[  132.319874] x86_lbr                   31KB         31KB
[  132.319874] task_delay_info          143KB        143KB
[  132.319875] taskstats                 35KB         35KB
[  132.319875] proc_dir_entry           137KB        137KB
[  132.319876] pde_opener                31KB         31KB
[  132.319876] seq_file                  31KB         31KB
[  132.319877] shmem_inode_cache       1299KB       1345KB
[  132.319878] kernfs_node_cache       5876KB       5876KB
[  132.319878] mnt_cache                250KB        250KB
[  132.319880] filp                     618KB        752KB
[  132.319880] names_cache              256KB        256KB
[  132.319881] lsm_file_cache            35KB         35KB
[  132.319881] uts_namespace             31KB         31KB
[  132.319882] vm_area_struct           661KB        667KB
[  132.319882] mm_struct                255KB        255KB
[  132.319883] files_cache              158KB        158KB
[  132.319883] signal_cache            1417KB       1417KB
[  132.319883] sighand_cache           1020KB       1020KB
[  132.319885] task_struct             1746KB       1980KB
[  132.319887] cred_jar                 231KB        287KB
[  132.319887] anon_vma_chain           380KB        404KB
[  132.319888] anon_vma                 510KB        510KB
[  132.319888] pid                      212KB        212KB
[  132.319888] Acpi-Operand            1102KB       1102KB
[  132.319889] Acpi-Parse                31KB         31KB
[  132.319889] Acpi-State                47KB         47KB
[  132.319889] numa_policy                7KB          7KB
[  132.319890] trace_event_file         166KB        166KB
[  132.319890] ftrace_event_field        609KB        609KB
[  132.319891] pool_workqueue            96KB         96KB
[  132.319891] task_group               126KB        126KB
[  132.319907] vmap_area               1435KB       1660KB
[  132.319908] kmalloc-8k              1216KB       1216KB
[  132.319909] kmalloc-4k             80640KB      80736KB
[  132.319909] kmalloc-2k            152768KB     152768KB
[  132.319910] kmalloc-1k             14496KB      14496KB
[  132.319910] kmalloc-512             1443KB       1488KB
[  132.319913] kmalloc-256           115340KB     115400KB
[  132.319913] kmalloc-192            87364KB      87365KB
[  132.319913] kmalloc-128              368KB        368KB
[  132.319914] kmalloc-96             50246KB      50246KB
[  132.319914] kmalloc-64             30264KB      30264KB
[  132.319914] kmalloc-32             59984KB      59984KB
[  132.319914] kmalloc-16               212KB        212KB
[  132.319915] kmalloc-8                 92KB         92KB
[  132.319915] kmem_cache_node           32KB         32KB
[  132.319915] kmem_cache                88KB         88KB
[  132.319916] Tasks state (memory values in pages):
[  132.319916] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[  132.319919] [    330]     0   330     6087        1    73728      658         -1000 systemd-udevd
[  132.319922] [   1296]     0  1296     3367        0    69632      255         -1000 sshd
[  132.319923] [   1521]  1000  1521     1471        0    49152        0             0 bonnie++
[  132.319923] Out of memory and no killable processes...
[  132.319924] Kernel panic - not syncing: System is deadlocked on memory
[  132.319933] CPU: 2 PID: 1 Comm: systemd Kdump: loaded Tainted: P           OE     5.10.0-21-amd64 #1 Debian 5.10.162-1
[  132.319944] Hardware name: Gigabyte Technology Co., Ltd. B660I AORUS PRO DDR4/B660I AORUS PRO DDR4, BIOS F21 11/15/2022
[  132.319955] Call Trace:
[  132.319959]  dump_stack+0x6b/0x83
[  132.319964]  panic+0x101/0x2d7
[  132.319968]  out_of_memory.cold+0x2f/0x7e
[  132.319976]  __alloc_pages_slowpath.constprop.0+0xbcc/0xc90
[  132.319983]  __alloc_pages_nodemask+0x2de/0x310
[  132.320000]  pagecache_get_page+0x175/0x390
[  132.320005]  filemap_fault+0x6a2/0x900
[  132.320010]  ? xas_load+0x5/0x80
[  132.320021]  ext4_filemap_fault+0x2d/0x50 [ext4]
[  132.320027]  __do_fault+0x34/0x170
[  132.320032]  handle_mm_fault+0x124f/0x1c00
[  132.320038]  do_user_addr_fault+0x1b8/0x400
[  132.320044]  exc_page_fault+0x78/0x160
[  132.320049]  ? asm_exc_page_fault+0x8/0x30
[  132.320054]  asm_exc_page_fault+0x1e/0x30
[  132.320461] RIP: 0033:0x7f4cfee190f0
[  132.320831] Code: Unable to access opcode bytes at RIP 0x7f4cfee190c6.
[  132.321223] RSP: 002b:00007ffee89a8638 EFLAGS: 00010202
[  132.321608] RAX: 00007f4cff1a1ca0 RBX: 000056369cfd1040 RCX: 000056369cfc4fa0
[  132.322234] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000128
[  132.322897] RBP: 0000000000000000 R08: 3ba4b432781f6494 R09: 80d80bc654633720
[  132.323534] R10: e5041ae37db69227 R11: 0000000000000000 R12: 00007ffee89a86a0
[  132.324171] R13: 0000000000000008 R14: 0000000000000001 R15: 00007f4cff3ce568
rincebrain commented 1 year ago

That stack trace is you triggering an OOM from the ext4 code, not a kernel panic.

ZFS's memory usage there doesn't seem absurd, from the output, but there seems to be a lot of RAM not covered by the output there if it's got 32 GIB in total.

Could you share more of the output, ideally the first 200 or so lines after it starts to go off the rails?

e: also cat /proc/slabinfo over time would be good to capture. My guess here would be, since it's reporting a lot of memory usage in unreclaimable slabs but the math doesn't seem to add up there, assuming I'm not missing a zero somewhere, that memory fragmentation is biting you, and you're making and (sometimes) freeing lots of small allocations, then when you actually want a contiguous 1M chunk (or some similarly large one) for something, and it's an allocation that can't fail, we're now running around killing things trying to free enough, but it still isn't winding up finding a large enough contiguous chunk anywhere.

In theory, I think the OOM messages should print somewhere around when they start how much they're trying to find, so that'd be a hint.

Frank030366 commented 1 year ago

No problem - here is file attached. It's a dmesg dump created by kdump-tools.

dmesg.202304101704.tar.gz

That stack trace is you triggering an OOM from the ext4 code, not a kernel panic.

ZFS's memory usage there doesn't seem absurd, from the output, but there seems to be a lot of RAM not covered by the output there if it's got 32 GIB in total.

Could you share more of the output, ideally the first 200 or so lines after it starts to go off the rails?

rincebrain commented 1 year ago

That's odd - the first thing it mentions is running out of RAM trying to fork(), essentially, unless I'm really bad at kernel code, which should be a very small allocation, comparatively...

There's also almost no binaries running. That's very strange indeed. I'll try to reproduce this later when I'm home with my nice testbeds.

Frank030366 commented 1 year ago

As an addition to previous info I've compiled several ZFS versions from release source tarballs and tested it and it seems that this bug first "appeared" in version 2.1.6 - because on versions 2.1.2-2.1.5 test runs without any issues but on 2.1.6 (and on 2.1.7 and 2.1.8 as well) it crashes the same way as on 2.1.9. Maybe it can help to find the root cause...

rincebrain commented 1 year ago

My wild guess would have been an interaction with #12746 but that was 2.1.5.

Simple enough to bisect assuming it reproduces readily, I suppose.

tmcqueen-materials commented 1 year ago

A bisection would be great. I personally would suspect https://github.com/openzfs/zfs/pull/13794 as the direct origin of the symptoms, as it changes how ZFS frees up memory. If so, that may mean that there is some other underlying memory accounting/nonfreeing/fragmentation issue and that this just reveals it…

Frank030366 commented 1 year ago

Returning back here from very long session of testing various builds. Sorry for giving (as I've detected later) wrong info about where bug is probably first appeared - more deeper testing with bonnie++ n passes detected another picture. First I've built several release tarball versions and detected after tests that the latest "clean" version is 2.0.5, 2.0.6 is broken already. Second - I've got a sequence of commits between that versions (https://github.com/openzfs/zfs/compare/zfs-2.0.5...zfs-2.0.6) and found (through simple test driven) a commit which leads to crashes, here it is: https://github.com/openzfs/zfs/commit/7a41ef240a99f247d467e1bfa676b855d361174d I really don't have a clue how this commit can trigger such issue but I've checked 2 times: with build from previous commit (https://github.com/openzfs/zfs/commit/72d16a9b49f54a44a471f881d767621e2bfbe913) everything goes fine and with the next one I've mentioned above my test starts crashing. I haven't enough knowledge of this code to explain why it goes this way, but tests shows that this commit is triggering kernel panic issue... P.S. Oh crap... After 8 pass testing instead of previous 4 pass commit https://github.com/openzfs/zfs/commit/72d16a9b49f54a44a471f881d767621e2bfbe913 crashed again... Will keep at least 10 pass test running with 2.0.5 release to make sure that it's stable...

Frank030366 commented 1 year ago

Even 2.0.3 version crashing from time to time - so it's really unclear which change is root cause. In addition I see strange output while getting info from kernel crash dump - kmem -i shows that all memory consumed, but kmem -s isn't give an answer who is consuming it (dumped after zfs 2.1.9 crash):

crash> kmem -i
                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM  8105286      30.9 GB         ----
         FREE    50721     198.1 MB    0% of TOTAL MEM
         USED  8054565      30.7 GB   99% of TOTAL MEM
       SHARED     5214      20.4 MB    0% of TOTAL MEM
      BUFFERS       78       312 KB    0% of TOTAL MEM
       CACHED      -45  70368744177664 GB  2275890582233% of TOTAL MEM
         SLAB   115517     451.2 MB    1% of TOTAL MEM

   TOTAL HUGE        0            0         ----
    HUGE FREE        0            0    0% of TOTAL HUGE

   TOTAL SWAP   249855       976 MB         ----
    SWAP USED     3328        13 MB    1% of TOTAL SWAP
    SWAP FREE   246527       963 MB   98% of TOTAL SWAP

 COMMIT LIMIT  4302498      16.4 GB         ----
    COMMITTED   508454       1.9 GB   11% of TOTAL LIMIT

crash> kmem -s | awk '{print $1,$2, $5*$6"k", $7}' | sort -nrk3 | column -t | head
ffff93d344787000  1248     1040416k  zio_cache
ffff93d344787e00  512      181600k   zio_buf_comb_512
ffff93d340043400  2048     139488k   kmalloc-2k
ffff93d340043300  4096     73888k    kmalloc-4k
ffff93d340043700  256      59728k    kmalloc-256
ffff93d340043800  192      45728k    kmalloc-192
ffff93d344786e00  48       39792k    zio_link_cache
ffff93d344786a00  104      38120k    abd_t
ffff93d340043c00  32       31720k    kmalloc-32
ffff93d340043a00  96       29020k    kmalloc-96
tmcqueen-materials commented 1 year ago

Is there a version after which this is much more easily triggered? For example, it is true it happens ~quickly on 2.1.6+, and much more rarely on 2.0.3-2.1.5? I strongly suspect multiple issues at play.

That negative cache value is funny — I guess some tabulation race condition from being under memory pressure and constantly dropping caches and things.

Frank030366 commented 1 year ago

Is there a version after which this is much more easily triggered? For example, it is true it happens ~quickly on 2.1.6+, and much more rarely on 2.0.3-2.1.5? I strongly suspect multiple issues at play.

That negative cache value is funny — I guess some tabulation race condition from being under memory pressure and constantly dropping caches and things.

Yeah, on 2.1.9 this issue triggered quickly after 1-2 tests and in addition not only with bonnie++ - for example this fio test trigger it as well: cd /testpool && fio --name rw --rw rw --size 2G

Frank030366 commented 1 year ago

In addition I was able to reproduce this issue even on VMware virtual machine. I've prepared an OVF with all development tools required to compile a ZFS release and prepared ZFS pool with required settings and script to execute the test, kdump-tools is configured as well to produce a crash dump: https://drive.google.com/drive/folders/10lb9pjK23BHiezq08NveaRhJ0bddw5XK All need to do to reproduce the issue is just start the VM, login to root account (pass 12345678) and execute zfs-crash-test.sh script, ZFS 2.1.9 is used.

Frank030366 commented 1 year ago

After further several tests I've found one more thing: this issue stops reproducing if I change logbias setting from "throughput" to "latency". Let's write here initial settings from issue description:

zfs set atime=off relatime=on xattr=sa dnodesize=auto logbias=throughput sync=standard dedup=off recordsize=1M compression=off testpool

With this settings we have a kernel panic after testing, and after just apply one setting below tests are going fine even with 1M recordsize:

zfs set logbias=latency testpool

I've maid a quick search in source code for 2.1.9 version and found that only one method behavior depends on this setting actually and it's "zfs_log_write" - here it is: https://github.com/openzfs/zfs/blob/zfs-2.1.9/module/zfs/zfs_log.c#L532 I'm not familiar with ZFS code, but at least this place seems strange for me: https://github.com/openzfs/zfs/blob/zfs-2.1.9/module/zfs/zfs_log.c#L586 Why we use "len" variable as additional increment value here when wr_state is WR_COPIED although len actually changed when wr_state is WR_INDIRECT? Maybe it's just my dumb question but... Anyway could you pls guys take a look at this portion of code? Because simple tests shows that this issue root cause probably in this method behavior.

P.S. Right now I've compiled ZFS packages from 2.1.9 tag with only one change in https://github.com/openzfs/zfs/blob/zfs-2.1.9/module/zfs/zfs_log.c#L586 - changed (wr_state == WR_COPIED ? len : 0)); to (wr_state == WR_INDIRECT ? len : 0)); - and crash test runs fine even with logbias set to throughput and recordsize 1M. Can somebody with more knowledge than me (because the only thing I can do is just dumb test driven like methods to find the root cause) here review this code and give more authoritative opinion - am I on the right way or not?

tmcqueen-materials commented 1 year ago

That change will cause an out of bounds memory writes on synchronous writes less than immediate_sz in size on line 595, so is not what you want. The fact that allocating a larger itx fixes the issue for wr_indirect records suggests an issue with that codepath writing beyond the record (when no additional length is allocated), but I don’t immediately see the bug that would cause that.

Frank030366 commented 1 year ago

That change will cause an out of bounds memory writes on synchronous writes less than immediate_sz in size on line 595, so is not what you want. The fact that allocating a larger itx fixes the issue for wr_indirect records suggests an issue with that codepath writing beyond the record (when no additional length is allocated), but I don’t immediately see the bug that would cause that.

I've got your point but in the case of writing beyond the record we're usually get an access violation error or something similar while in this case I get kernel panic due to out of memory - usually it happens when code allocates large amount of memory but not free it, correct me if I'm wrong pls.

tmcqueen-materials commented 1 year ago

Well, one possibility is there aren’t any OOB writes and that your patch works by making the itx allocations larger, helping prevent memory fragmentation. If you are going to keep using your patch, I would change it to just unconditionally be len (rather than len or 0), to avoid an issue with small synchronous writes.

it would be great for a maintainer to chime in here since you’ve found a great reproducer.