openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.4k stars 1.72k forks source link

null pointer deref on 2.2.0 #15485

Open prometheanfire opened 10 months ago

prometheanfire commented 10 months ago

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version Gentoo
Kernel Version 6.5.9-gentoo-dist
Architecture x86_64
OpenZFS Version zfs-2.2-release (built a 2023-10-29, tagged 2.2.0 would be the same)

Describe the problem you're observing

Backtrace when emerging packages, so high IO. Normally I use tmpfs but not enough ram for webkit...

Describe how to reproduce the problem

happend when installing gentoo-kernel-bin

Include any warning/errors/backtraces from the system logs

[41067.230629] BUG: kernel NULL pointer dereference, address: 0000000000000000
[41067.230639] #PF: supervisor read access in kernel mode
[41067.230642] #PF: error_code(0x0000) - not-present page
[41067.230645] PGD 0 P4D 0
[41067.230651] Oops: 0000 [#1] PREEMPT SMP NOPTI
[41067.230655] CPU: 12 PID: 581 Comm: dp_sync_taskq Tainted: P           OE      6.5.9-gentoo-dist #1
[41067.230660] Hardware name: LENOVO 20Y1CT01WW/20Y1CT01WW, BIOS R1BET75W(1.44 ) 06/13/2023
[41067.230663] RIP: 0010:arc_write+0x6c/0x2530 [zfs]
[41067.230810] Code: 7a 40 48 89 b5 50 ff ff ff 41 8b 72 30 4d 8b 5a 20 48 89 95 60 ff ff ff 4d 8b 42 28 41 8b 12 48 89 8d 58 ff ff ff 45 8b 72 38 <49> 8b 1c 24 89 b5 4c ff ff ff 48 89 bd 40 ff ff ff 65 48 8b 0c 25
[41067.230815] RSP: 0018:ffffaef6c2fbb978 EFLAGS: 00010286
[41067.230819] RAX: ffffaef6c2fbbaf8 RBX: ffff8f1fa8629880 RCX: ffff8f1e66001650
[41067.230822] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffaef6c2fbbad8
[41067.230825] RBP: ffffaef6c2fbba48 R08: ffff8f1fa8629880 R09: 0000000000000000
[41067.230828] R10: ffffaef6c2fbba58 R11: ffffffffc08be820 R12: 0000000000000000
[41067.230831] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[41067.230834] FS:  0000000000000000(0000) GS:ffff8f2490b00000(0000) knlGS:0000000000000000
[41067.230838] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41067.230841] CR2: 0000000000000000 CR3: 000000016fa1a000 CR4: 0000000000350ee0
[41067.230844] Call Trace:
[41067.230849]  <TASK>
[41067.230854]  ? __die+0x23/0x70
[41067.230862]  ? page_fault_oops+0x171/0x4e0
[41067.230871]  ? exc_page_fault+0x7f/0x180
[41067.230878]  ? asm_exc_page_fault+0x26/0x30
[41067.230885]  ? dbuf_rele+0x50/0x500 [zfs]
[41067.231015]  ? arc_write+0x6c/0x2530 [zfs]
[41067.231136]  ? arc_getbuf_func+0x30/0x260 [zfs]
[41067.231267]  ? dmu_buf_unlock_parent+0x90/0xdc0 [zfs]
[41067.231409]  ? srso_return_thunk+0x5/0x10
[41067.231417]  dbuf_is_l2cacheable+0x4b1/0x6b0 [zfs]
[41067.231559]  ? dmu_buf_unlock_parent+0x90/0xdc0 [zfs]
[41067.231691]  ? dbuf_rele+0x50/0x500 [zfs]
[41067.231838]  ? srso_return_thunk+0x5/0x10
[41067.231843]  ? dbuf_hold_impl+0x112/0x760 [zfs]
[41067.232001]  dbuf_hold+0x41e/0x9a0 [zfs]
[41067.232139]  dbuf_sync_list+0xaa/0x110 [zfs]
[41067.232262]  dbuf_assign_arcbuf+0x570/0x600 [zfs]
[41067.232383]  dbuf_sync_list+0x4c/0x110 [zfs]
[41067.232503]  dnode_sync+0x413/0x15a0 [zfs]
[41067.232645]  dmu_objset_clone+0x5b5/0x6e0 [zfs]
[41067.232776]  taskq_dispatch+0x50b/0x700 [spl]
[41067.232791]  ? __pfx_default_wake_function+0x10/0x10
[41067.232804]  ? taskq_dispatch+0x2a0/0x700 [spl]
[41067.232814]  kthread+0xe8/0x120
[41067.232820]  ? __pfx_kthread+0x10/0x10
[41067.232826]  ret_from_fork+0x34/0x50
[41067.232832]  ? __pfx_kthread+0x10/0x10
[41067.232836]  ret_from_fork_asm+0x1b/0x30
[41067.232848]  </TASK>
[41067.232850] Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel nf_conntrack_netlink br_netfilter rfcomm nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_tftp nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib bridge stp llc nft_reject_inet nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute overlay ip_set nfnetlink ebtable_filter ebtables xt_MASQUERADE xt_addrtype iptable_nat xt_CHECKSUM iptable_mangle iptable_raw ipt_REJECT nf_reject_ipv4 xt_conntrack iptable_filter iptable_security ip_tables ip6table_nat qrtr bnep nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_mangle ip6table_raw ip6table_security ip6table_filter ip6_tables uvcvideo uvc videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common btusb videodev btrtl btbcm btintel btmtk bluetooth mc vfat fat amdgpu iwlmvm snd_soc_dmic snd_acp3x_pdm_dma snd_acp3x_rn intel_rapl_msr snd_sof_amd_rembrandt intel_rapl_common snd_sof_amd_renoir mac80211
[41067.232961]  snd_sof_amd_acp snd_sof_pci libarc4 snd_sof_xtensa_dsp snd_ctl_led snd_sof edac_mce_amd snd_hda_codec_realtek amdxcp snd_sof_utils iommu_v2 snd_hda_codec_generic snd_hda_codec_hdmi snd_soc_core gpu_sched kvm_amd iwlwifi snd_hda_intel i2c_algo_bit drm_suballoc_helper snd_intel_dspcfg snd_compress drm_ttm_helper ttm snd_intel_sdw_acpi ac97_bus tps6598x snd_pcm_dmaengine kvm snd_hda_codec drm_display_helper snd_pci_ps cfg80211 snd_rpl_pci_acp6x snd_hda_core snd_pci_acp6x irqbypass cec snd_pci_acp5x snd_hwdep rapl thinkpad_acpi snd_pcm snd_rn_pci_acp3x drm_kms_helper ledtrig_audio snd_acp_config platform_profile snd_soc_acpi think_lmi snd_timer firmware_attributes_class wmi_bmof acpi_cpufreq pcspkr rfkill drm_buddy ipmi_devintf r8169 snd_pci_acp3x snd ipmi_msghandler k10temp i2c_piix4 soundcore serial_multi_instantiate i2c_scmi joydev tun fuse lm92 loop zfs(POE) spl(OE) crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic rtsx_pci_sdmmc ghash_clmulni_intel mmc_core sha512_ssse3 nvme
[41067.233077]  sp5100_tco ucsi_acpi ccp nvme_core typec_ucsi rtsx_pci nvme_common typec video wmi serio_raw dm_multipath
[41067.233099] CR2: 0000000000000000
[41067.233104] ---[ end trace 0000000000000000 ]---
[41067.233107] RIP: 0010:arc_write+0x6c/0x2530 [zfs]
[41067.233229] Code: 7a 40 48 89 b5 50 ff ff ff 41 8b 72 30 4d 8b 5a 20 48 89 95 60 ff ff ff 4d 8b 42 28 41 8b 12 48 89 8d 58 ff ff ff 45 8b 72 38 <49> 8b 1c 24 89 b5 4c ff ff ff 48 89 bd 40 ff ff ff 65 48 8b 0c 25
[41067.233233] RSP: 0018:ffffaef6c2fbb978 EFLAGS: 00010286
[41067.233237] RAX: ffffaef6c2fbbaf8 RBX: ffff8f1fa8629880 RCX: ffff8f1e66001650
[41067.233239] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffaef6c2fbbad8
[41067.233242] RBP: ffffaef6c2fbba48 R08: ffff8f1fa8629880 R09: 0000000000000000
[41067.233245] R10: ffffaef6c2fbba58 R11: ffffffffc08be820 R12: 0000000000000000
[41067.233247] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[41067.233250] FS:  0000000000000000(0000) GS:ffff8f2490b00000(0000) knlGS:0000000000000000
[41067.233253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41067.233256] CR2: 0000000000000000 CR3: 000000016fa1a000 CR4: 0000000000350ee0
[41067.233259] note: dp_sync_taskq[581] exited with irqs disabled
[41067.234127] BUG: kernel NULL pointer dereference, address: 0000000000000000
[41067.234135] #PF: supervisor read access in kernel mode
[41067.234138] #PF: error_code(0x0000) - not-present page
[41067.234141] PGD 0 P4D 0
[41067.234146] Oops: 0000 [#2] PREEMPT SMP NOPTI
[41067.234150] CPU: 4 PID: 583 Comm: dp_sync_taskq Tainted: P      D    OE      6.5.9-gentoo-dist #1
[41067.234155] Hardware name: LENOVO 20Y1CT01WW/20Y1CT01WW, BIOS R1BET75W(1.44 ) 06/13/2023
[41067.234158] RIP: 0010:arc_write+0x6c/0x2530 [zfs]
[41067.234283] Code: 7a 40 48 89 b5 50 ff ff ff 41 8b 72 30 4d 8b 5a 20 48 89 95 60 ff ff ff 4d 8b 42 28 41 8b 12 48 89 8d 58 ff ff ff 45 8b 72 38 <49> 8b 1c 24 89 b5 4c ff ff ff 48 89 bd 40 ff ff ff 65 48 8b 0c 25
[41067.234286] RSP: 0018:ffffaef6c2fcb978 EFLAGS: 00010286
[41067.234290] RAX: ffffaef6c2fcbaf8 RBX: ffff8f1e1de261b0 RCX: ffff8f1e63b32e50
[41067.234293] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffaef6c2fcbad8
[41067.234296] RBP: ffffaef6c2fcba48 R08: ffff8f1e1de261b0 R09: 0000000000000000
[41067.234299] R10: ffffaef6c2fcba58 R11: ffffffffc08be820 R12: 0000000000000000
[41067.234301] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[41067.234304] FS:  0000000000000000(0000) GS:ffff8f2490900000(0000) knlGS:0000000000000000
[41067.234307] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41067.234310] CR2: 0000000000000000 CR3: 000000036ea32000 CR4: 0000000000350ee0
[41067.234314] Call Trace:
[41067.234319]  <TASK>
[41067.234323]  ? __die+0x23/0x70
[41067.234331]  ? page_fault_oops+0x171/0x4e0
[41067.234341]  ? exc_page_fault+0x7f/0x180
[41067.234347]  ? asm_exc_page_fault+0x26/0x30
[41067.234355]  ? dbuf_rele+0x50/0x500 [zfs]
[41067.234478]  ? arc_write+0x6c/0x2530 [zfs]
[41067.234595]  ? arc_getbuf_func+0x30/0x260 [zfs]
[41067.234706]  ? dmu_buf_unlock_parent+0x90/0xdc0 [zfs]
[41067.234827]  ? srso_return_thunk+0x5/0x10
[41067.234835]  dbuf_is_l2cacheable+0x4b1/0x6b0 [zfs]
[41067.234966]  ? dmu_buf_unlock_parent+0x90/0xdc0 [zfs]
[41067.235076]  ? dbuf_rele+0x50/0x500 [zfs]
[41067.235209]  ? srso_return_thunk+0x5/0x10
[41067.235213]  ? dbuf_hold_impl+0x112/0x760 [zfs]
[41067.235335]  dbuf_hold+0x41e/0x9a0 [zfs]
[41067.235457]  dbuf_sync_list+0xaa/0x110 [zfs]
[41067.235575]  dbuf_assign_arcbuf+0x570/0x600 [zfs]
[41067.235693]  dbuf_sync_list+0x4c/0x110 [zfs]
[41067.235810]  dnode_sync+0x413/0x15a0 [zfs]
[41067.235950]  dmu_objset_clone+0x5b5/0x6e0 [zfs]
[41067.236074]  taskq_dispatch+0x50b/0x700 [spl]
[41067.236087]  ? __pfx_default_wake_function+0x10/0x10
[41067.236098]  ? taskq_dispatch+0x2a0/0x700 [spl]
[41067.236108]  kthread+0xe8/0x120
[41067.236114]  ? __pfx_kthread+0x10/0x10
[41067.236119]  ret_from_fork+0x34/0x50
[41067.236125]  ? __pfx_kthread+0x10/0x10
[41067.236129]  ret_from_fork_asm+0x1b/0x30
[41067.236140]  </TASK>
[41067.236142] Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel nf_conntrack_netlink br_netfilter rfcomm nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_tftp nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib bridge stp llc nft_reject_inet nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute overlay ip_set nfnetlink ebtable_filter ebtables xt_MASQUERADE xt_addrtype iptable_nat xt_CHECKSUM iptable_mangle iptable_raw ipt_REJECT nf_reject_ipv4 xt_conntrack iptable_filter iptable_security ip_tables ip6table_nat qrtr bnep nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_mangle ip6table_raw ip6table_security ip6table_filter ip6_tables uvcvideo uvc videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common btusb videodev btrtl btbcm btintel btmtk bluetooth mc vfat fat amdgpu iwlmvm snd_soc_dmic snd_acp3x_pdm_dma snd_acp3x_rn intel_rapl_msr snd_sof_amd_rembrandt intel_rapl_common snd_sof_amd_renoir mac80211
[41067.236243]  snd_sof_amd_acp snd_sof_pci libarc4 snd_sof_xtensa_dsp snd_ctl_led snd_sof edac_mce_amd snd_hda_codec_realtek amdxcp snd_sof_utils iommu_v2 snd_hda_codec_generic snd_hda_codec_hdmi snd_soc_core gpu_sched kvm_amd iwlwifi snd_hda_intel i2c_algo_bit drm_suballoc_helper snd_intel_dspcfg snd_compress drm_ttm_helper ttm snd_intel_sdw_acpi ac97_bus tps6598x snd_pcm_dmaengine kvm snd_hda_codec drm_display_helper snd_pci_ps cfg80211 snd_rpl_pci_acp6x snd_hda_core snd_pci_acp6x irqbypass cec snd_pci_acp5x snd_hwdep rapl thinkpad_acpi snd_pcm snd_rn_pci_acp3x drm_kms_helper ledtrig_audio snd_acp_config platform_profile snd_soc_acpi think_lmi snd_timer firmware_attributes_class wmi_bmof acpi_cpufreq pcspkr rfkill drm_buddy ipmi_devintf r8169 snd_pci_acp3x snd ipmi_msghandler k10temp i2c_piix4 soundcore serial_multi_instantiate i2c_scmi joydev tun fuse lm92 loop zfs(POE) spl(OE) crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic rtsx_pci_sdmmc ghash_clmulni_intel mmc_core sha512_ssse3 nvme
[41067.236347]  sp5100_tco ucsi_acpi ccp nvme_core typec_ucsi rtsx_pci nvme_common typec video wmi serio_raw dm_multipath
[41067.236367] CR2: 0000000000000000
[41067.236371] ---[ end trace 0000000000000000 ]---
[41067.236374] RIP: 0010:arc_write+0x6c/0x2530 [zfs]
[41067.236491] Code: 7a 40 48 89 b5 50 ff ff ff 41 8b 72 30 4d 8b 5a 20 48 89 95 60 ff ff ff 4d 8b 42 28 41 8b 12 48 89 8d 58 ff ff ff 45 8b 72 38 <49> 8b 1c 24 89 b5 4c ff ff ff 48 89 bd 40 ff ff ff 65 48 8b 0c 25
[41067.236494] RSP: 0018:ffffaef6c2fbb978 EFLAGS: 00010286
[41067.236498] RAX: ffffaef6c2fbbaf8 RBX: ffff8f1fa8629880 RCX: ffff8f1e66001650
[41067.236500] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffaef6c2fbbad8
[41067.236503] RBP: ffffaef6c2fbba48 R08: ffff8f1fa8629880 R09: 0000000000000000
[41067.236506] R10: ffffaef6c2fbba58 R11: ffffffffc08be820 R12: 0000000000000000
[41067.236508] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[41067.236511] FS:  0000000000000000(0000) GS:ffff8f2490900000(0000) knlGS:0000000000000000
[41067.236514] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41067.236517] CR2: 0000000000000000 CR3: 000000036ea32000 CR4: 0000000000350ee0
[41067.236520] note: dp_sync_taskq[583] exited with irqs disabled
prometheanfire commented 9 months ago

Lucky I created this, system locked up, pool died.

This may be related to the encryption / block cloning (suggested on irc)

I have a picture of trying to reboot (an older kernel but still, backtrace, also occurred on the above referenced versions) https://photos.app.goo.gl/8KMLgm9mVSob3FZ46

oromenahar commented 9 months ago

Just a hint: If this is related to block cloning you can use version 2.1.13 or earlier. On this versions is no block cloning support wired up to the linux kernel. Did you tried it on one of this versions?

prometheanfire commented 9 months ago

I've been on 2.2 based modules for a long time now (probably over a year, very long time). I've added checkpoints to the pool for my weekly backup progress (checkpoint a pool after a week of 'testing'). Also added an external backup I can boot from that's basically a clone of the internal drive.

oromenahar commented 9 months ago

Oh, no easy way to just install an 2.1.13 to check if this is related to non block cloning stuff. Thanks for the information.

prometheanfire commented 9 months ago

Ya, I tried every kernel I had available back til they started to fail to import due to pool features being enabled that were not supported.

sempervictus commented 9 months ago

@prometheanfire: master branch has a potential fix for the block cloning problem, but seems we're doing the same "diverge from master by a mile" thing for 2.2 so might be a while before it's in a tag (there is a PR already for 2.2.1-staging)

prometheanfire commented 9 months ago

Thanks, I'll be watching the 2.2.1 branch.

prometheanfire commented 9 months ago

I'm not sure if this traceback is directly related to the block cloning issue so I'm not sure it'd be right to close this in favor of that pull.

mtippmann commented 9 months ago

image

It's still there on current git - tested with Kernel 6.1.61 LTS, Kernel 6.6.1, Kernel 6.5.9 (all on Arch Linux)

$ zpool version
zfs-2.2.99-202_g887a3c533b
zfs-kmod-2.2.99-202_g887a3c533b

It's also there on zfs 2.2.0 - the pool just has a single encrypted dataset that is not used during the build.

I can trigger it reliable building OpenWrt (Linux Distro for Wireless Routers)

$ git clone https://github.com/openwrt/openwrt
$ cd openwrt 
$ ./scripts/feeds update -a && ./scripts/feeds install -a 
$ make defconfig
$ make -j$(nproc) 
...
machine hangs 
...

Additionally the whole pool was corrupted one time also during OpenWrt compile using archzfs dkms git (2023.10.26.r8843.g043c6ee3b6-1)

image

unfortunatly no textual represention - in that case only importing read-only and send/recv to a new pool helped. I tried to disable zil playback using echo 1 > /sys/module/zfs/parameters/zil_replay_disable but no effect.

I also tried using https://github.com/zabbly/zfs on Ubuntu 22.04 (Kernel 5.15) but importing also fails and machine hangs.

I dd'ed the pool to a disk so if someone needs additional debugging information I can try.

coretuils on arch is recent enough to use block cloning but the whole pool was encrypted.

oukb commented 9 months ago

The same errors with branch 2.2.1 and master 2.2.0-1_g459c99ff2

/lib/modules/6.5.11-300.fc39.x86_64/extra/zfs.ko.xz

[23433.258922] BUG: kernel NULL pointer dereference, address: 0000000000000000
[23433.258965] #PF: supervisor read access in kernel mode
[23433.258986] #PF: error_code(0x0000) - not-present page
[23433.259007] PGD 0 P4D 0
[23433.259021] Oops: 0000 [#1] PREEMPT SMP NOPTI
[23433.259038] CPU: 40 PID: 9472 Comm: dp_sync_taskq Tainted: P           OE      6.5.11-300.fc39.x86_64 #1
[23433.259064] Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.6a 09/27/2023
[23433.259085] RIP: 0010:arc_write+0x6c/0x490 [zfs]
[23433.259386] Code: 7a 40 48 89 b5 50 ff ff ff 41 8b 72 30 4d 8b 5a 20 48 89 95 60 ff ff ff 4d 8b 42 28 41 8b 12 48 89 8d 58 ff ff ff 45 8b 72 38 <49> 8b 1c 24 89 b5 4c ff ff ff 48 89 bd 40 ff ff ff 65 48 8b 0c 25
[23433.259423] RSP: 0018:ffffb104a77c7958 EFLAGS: 00010282
[23433.259438] RAX: ffffb104a77c7ad8 RBX: ffff909fc297ac68 RCX: ffff909e35dd3e50
[23433.259454] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffb104a77c7ab8
[23433.259469] RBP: ffffb104a77c7a28 R08: ffff909fc297ac68 R09: 0000000000000000
[23433.259484] R10: ffffb104a77c7a38 R11: ffffffffc142dd60 R12: 0000000000000000
[23433.259498] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[23433.259513] FS:  0000000000000000(0000) GS:ffff90da4ee00000(0000) knlGS:0000000000000000
[23433.259529] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[23433.259541] CR2: 0000000000000000 CR3: 0000001e37222000 CR4: 0000000000350ee0
[23433.259556] Call Trace:
[23433.259566]  <TASK>
[23433.259575]  ? __die+0x23/0x70
[23433.259589]  ? page_fault_oops+0x171/0x4e0
[23433.259604]  ? exc_page_fault+0x7f/0x180
[23433.259618]  ? asm_exc_page_fault+0x26/0x30
[23433.259632]  ? __pfx_dbuf_write_done+0x10/0x10 [zfs]
[23433.259914]  ? arc_write+0x6c/0x490 [zfs]
[23433.260191]  ? __pfx_dbuf_write_ready+0x10/0x10 [zfs]
[23433.260448]  ? __pfx_arc_write_done+0x10/0x10 [zfs]
[23433.260702]  dbuf_write+0x3d1/0x5d0 [zfs]
[23433.260959]  ? __pfx_dbuf_write_ready+0x10/0x10 [zfs]
[23433.261219]  ? __pfx_dbuf_write_done+0x10/0x10 [zfs]
[23433.261474]  ? dbuf_hold_impl+0x112/0x760 [zfs]
[23433.261727]  dbuf_sync_leaf+0x139/0x710 [zfs]
[23433.261987]  dbuf_sync_list+0xc3/0x120 [zfs]
[23433.262251]  dbuf_sync_indirect+0xe0/0x170 [zfs]
[23433.262530]  dbuf_sync_list+0x51/0x120 [zfs]
[23433.262792]  dnode_sync+0x413/0xae0 [zfs]
[23433.263083]  sync_dnodes_task+0x75/0xb0 [zfs]
[23433.263361]  taskq_thread+0x2c0/0x4e0 [spl]
[23433.263388]  ? __pfx_default_wake_function+0x10/0x10
[23433.263404]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[23433.263746]  kthread+0xe5/0x120
[23433.264039]  ? __pfx_kthread+0x10/0x10
[23433.264330]  ret_from_fork+0x31/0x50
[23433.264620]  ? __pfx_kthread+0x10/0x10
[23433.264919]  ret_from_fork_asm+0x1b/0x30
[23433.265219]  </TASK>
[23433.265499] Modules linked in: veth nf_conntrack_netlink tls lz4 lz4_compress echainiv esp4 xfrm_interface xfrm6_tunnel tunnel4 tunnel6 tun zfs(POE) xt_policy xt_nat xt_tcpmss xt_TCPMSS xt_MASQUERADE xt_set xt_multiport xt_CHECKSUM bridge ip_set_hash_ip ts_bm xt_string xt_NFLOG nfnetlink_log xt_limit ip6table_mangle ip6table_nat ip_set nfnetlink iptable_mangle iptable_nat nf_nat 8021q garp mrp stp llc cfg80211 rfkill ip6t_REJECT nf_reject_ipv6 ip6table_filter ipt_REJECT nf_reject_ipv4 nct6775_core hwmon_vid xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter sunrpc binfmt_misc ipmi_ssif vfat fat intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd rapl acpi_cpufreq pcspkr joydev ocrdma acpi_ipmi ib_uverbs ipmi_si spl(OE) ipmi_devintf ib_core ipmi_msghandler ses ptdma enclosure i2c_piix4 k10temp sch_fq tcp_bbr loop ip6_tables ip_tables kvmgt mdev vfio_iommu_type1 vfio iommufd crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic kvm ghash_clmulni_intel nvme
[23433.265568]  sha512_ssse3 mpt3sas ccp nvme_core rndis_host cdc_ether usbnet irqbypass raid_class be2net tg3 scsi_transport_sas ast sp5100_tco nvme_common mii i915 i2c_algo_bit drm_buddy video wmi ttm drm_display_helper cec fuse
[23433.269212] CR2: 0000000000000000
[23433.270206] ---[ end trace 0000000000000000 ]---
[23434.562200] pstore: backend (erst) writing error (-28)
[23434.562590] RIP: 0010:arc_write+0x6c/0x490 [zfs]
[23434.563218] Code: 7a 40 48 89 b5 50 ff ff ff 41 8b 72 30 4d 8b 5a 20 48 89 95 60 ff ff ff 4d 8b 42 28 41 8b 12 48 89 8d 58 ff ff ff 45 8b 72 38 <49> 8b 1c 24 89 b5 4c ff ff ff 48 89 bd 40 ff ff ff 65 48 8b 0c 25
[23434.563968] RSP: 0018:ffffb104a77c7958 EFLAGS: 00010282
[23434.564350] RAX: ffffb104a77c7ad8 RBX: ffff909fc297ac68 RCX: ffff909e35dd3e50
[23434.564726] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffb104a77c7ab8
[23434.565110] RBP: ffffb104a77c7a28 R08: ffff909fc297ac68 R09: 0000000000000000
[23434.565503] R10: ffffb104a77c7a38 R11: ffffffffc142dd60 R12: 0000000000000000
[23434.565891] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[23434.566281] FS:  0000000000000000(0000) GS:ffff90da4ee00000(0000) knlGS:0000000000000000
[23434.566678] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[23434.567073] CR2: 0000000000000000 CR3: 0000001e37222000 CR4: 0000000000350ee0
[23434.567478] note: dp_sync_taskq[9472] exited with irqs disabled
kernel: PANIC: zfs: adding existent segment to range tree (offset=a18422c6000 size=4a000)

 kernel: Showing stack for process 2120
 kernel: CPU: 25 PID: 2120 Comm: txg_sync Tainted: P           OE      6.5.6-300.fc39.x86_64 #1
 kernel: Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.6a 09/27/2023
 kernel: Call Trace:
 kernel: <TASK>
 kernel: dump_stack_lvl+0x47/0x60
 kernel: vcmn_err+0xdf/0x120 [spl]
 kernel: zfs_panic_recover+0x79/0xa0 [zfs]
 kernel: range_tree_add_impl+0x28f/0xea0 [zfs]
 kernel: range_tree_remove_xor_add_segment+0x3b0/0x660 [zfs]
 kernel: range_tree_remove_xor_add+0x83/0x180 [zfs]
 kernel: metaslab_sync+0x275/0x960 [zfs]
 kernel: ? dmu_tx_create_dd+0xaa/0xf0 [zfs]
 kernel: vdev_sync+0x72/0x4c0 [zfs]
 kernel: ? spa_flush_metaslabs+0xfd/0x420 [zfs]
 kernel: ? dmu_tx_destroy+0xd6/0x130 [zfs]
 kernel: spa_sync+0x64d/0x1050 [zfs]
 kernel: ? spa_txg_history_init_io+0x117/0x120 [zfs]
 kernel: txg_sync_thread+0x1fe/0x390 [zfs]
 kernel: ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
 kernel: thread_generic_wrapper+0x5b/0x70 [spl]
 kernel: kthread+0xe5/0x120
 kernel: ? __pfx_kthread+0x10/0x10
 kernel: ret_from_fork+0x31/0x50
 kernel: ? __pfx_kthread+0x10/0x10
 kernel: ret_from_fork_asm+0x1b/0x30
 kernel: </TASK>

and WARNING: zfs: adding existent segment to range tree (offset=a18422c6000 size=4a000)

no errors at least before running zfs upgrade (and with zfs 2.1.*)

amotin commented 9 months ago

The panics on reboot quoted here look very alike to ones in this issue: https://github.com/openzfs/zfs/issues/15513 , caused by improper block cloning ZIL records encryption. The fix for that is currently in review. The original panic though may or may not be related to the block cloning.

mtippmann commented 7 months ago

@amotin my related issue when building openwrt is fixed in current master @prometheanfire maybe you could retry with current master and check if that fixes the issue?