openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.31k stars 1.71k forks source link

NULL pointer dereference in zap_leaf_lookup #13687

Open Avamander opened 1 year ago

Avamander commented 1 year ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04
Kernel Version 5.17.0-1013-oem
Architecture amd64
OpenZFS Version zfs-2.1.5-1ubuntu2

also tested 5.15.0-41-generic+2.1.2-1ubuntu3

Describe the problem you're observing

When trying to mount a zfs pool the kernel tries to dereference a NULL pointer and all reads stall.

Describe how to reproduce the problem

I can reliably reproduce this on my current setup, no idea how to do it on other machines.

Include any warning/errors/backtraces from the system logs

BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 5 PID: 2962 Comm: z_wr_iss Tainted: P           O      5.17.0-1013-oem #14-Ubuntu
Hardware name: MSI MS-7850/Z87-G41 PC Mate(MS-7850), BIOS V1.8 07/21/2014
RIP: 0010:zap_leaf_lookup+0x4d/0x170 [zfs]
Code: 48 89 55 d0 48 8b 58 18 8b 87 d0 00 00 00 8d 48 fb 83 f9 1f 0f 87 8f 5b 0d 00 48 89 da 41 bf 45 00 00 00 49 8b 76 30 41 29 c7 <0f> b7 42 20 41 29 c7 41 83 ff 3f 0f 87 4c 5b 0d 00 b8 01 00 00 00
RSP: 0018:ffffb7196e14b9e8 EFLAGS: 00010216
RAX: 000000000000000c RBX: 0000000000000000 RCX: 0000000000000007
RDX: 0000000000000000 RSI: c17acc2d070f0000 RDI: ffff9e2430566100
RBP: ffffb7196e14ba28 R08: 0000000000000000 R09: ffff9e2430566100
R10: ffff9e2430566198 R11: 0000000000000000 R12: 0000000000000000
R13: ffff9e2430566100 R14: ffff9e23edb1c400 R15: 0000000000000039
FS:  0000000000000000(0000) GS:ffff9e2a3fb40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000020 CR3: 00000006be610001 CR4: 00000000001706e0
Call Trace:
 <TASK>
 fzap_length+0x85/0xf0 [zfs]
 zap_length_uint64+0xe2/0x1c0 [zfs]
 ddt_zap_lookup+0x64/0xe0 [zfs]
 ddt_lookup+0x130/0x260 [zfs]
 ? abd_checksum_SHA256+0xd0/0xd0 [zfs]
 ? zio_checksum_compute+0x10d/0x560 [zfs]
 ? __kmalloc_node+0x1c4/0x3e0
 ? spl_kmem_alloc+0xb6/0x100 [spl]
 zio_ddt_write+0x68/0x430 [zfs]
 zio_execute+0x97/0x160 [zfs]
 taskq_thread+0x29c/0x4c0 [spl]
 ? wake_up_q+0x90/0x90
 ? zio_gang_tree_free+0x70/0x70 [zfs]
 ? taskq_thread_spawn+0x60/0x60 [spl]
 kthread+0xee/0x120
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x22/0x30
 </TASK>
Modules linked in: nvme_fabrics overlay ip6t_REJECT nf_reject_ipv6 nft_chain_nat xt_nat lz4 lz4_compress zram xt_MASQUERADE nf_nat xt_addrtype nft_limit xt_LOG nf_log_syslog xt_limit xt_tcpudp xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 nft_compat nf_tables nfnetlink snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi zfs(PO) snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep zunicode(PO) intel_rapl_msr snd_pcm intel_rapl_common zzstd(O) snd_seq_midi x86_pkg_temp_thermal snd_seq_midi_event zlua(O) intel_powerclamp nls_iso8859_1 snd_rawmidi zavl(PO) kvm_intel snd_seq icp(PO) mei_hdcp mei_pxp kvm zcommon(PO) ch341 snd_seq_device snd_timer tcp_bbr znvpair(PO) rapl usbserial input_leds snd spl(O) mei_me intel_cstate sch_cake soundcore mei at24 mac_hid tpm_infineon coretemp tcp_lp ip6_tables ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp ramoops pstore_blk parport mtd reed_solomon pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c dm_crypt dm_mirror dm_region_hash dm_log mlx4_ib ib_uverbs mlx4_en ib_core hid_generic usbhid hid uas usb_storage i915 i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt crct10dif_pclmul crc32_pclmul fb_sys_fops ghash_clmulni_intel cec aesni_intel rc_core nvme ahci mxm_wmi crypto_simd r8169 xhci_pci drm i2c_i801 libahci cryptd mlx4_core nvme_core lpc_ich realtek xhci_pci_renesas wmi i2c_smbus video
CR2: 0000000000000020
---[ end trace 0000000000000000 ]---
RIP: 0010:zap_leaf_lookup+0x4d/0x170 [zfs]
Code: 48 89 55 d0 48 8b 58 18 8b 87 d0 00 00 00 8d 48 fb 83 f9 1f 0f 87 8f 5b 0d 00 48 89 da 41 bf 45 00 00 00 49 8b 76 30 41 29 c7 <0f> b7 42 20 41 29 c7 41 83 ff 3f 0f 87 4c 5b 0d 00 b8 01 00 00 00
RSP: 0018:ffffb7196e14b9e8 EFLAGS: 00010216
RAX: 000000000000000c RBX: 0000000000000000 RCX: 0000000000000007
RDX: 0000000000000000 RSI: c17acc2d070f0000 RDI: ffff9e2430566100
RBP: ffffb7196e14ba28 R08: 0000000000000000 R09: ffff9e2430566100
R10: ffff9e2430566198 R11: 0000000000000000 R12: 0000000000000000
R13: ffff9e2430566100 R14: ffff9e23edb1c400 R15: 0000000000000039
FS:  0000000000000000(0000) GS:ffff9e2a3fb40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000020 CR3: 0000000152af6003 CR4: 00000000001706e0

Older kernel, older ZFS:

BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] SMP PTI
CPU: 6 PID: 891 Comm: z_wr_iss Tainted: P           O      5.15.0-41-generic #44-Ubuntu
Hardware name: MSI MS-7850/Z87-G41 PC Mate(MS-7850), BIOS V1.8 07/21/2014
RIP: 0010:zap_leaf_lookup+0x4d/0x160 [zfs]
Code: 48 89 55 d0 48 8b 58 18 8b 87 d0 00 00 00 8d 48 fb 83 f9 1f 0f 87 7c 5d 0d 00 48 89 da 41 bf 45 00 00 00 49 8b 76 30 41 29 c7 <0f> b7 42 20 41 29 c7 41 83 ff 3f 0f 87 39 5d 0d 00 b8 01 00 00 00
RSP: 0018:ffffae8e00a0b9e8 EFLAGS: 00010216
RAX: 000000000000000c RBX: 0000000000000000 RCX: 0000000000000007
RDX: 0000000000000000 RSI: c17acc2d070f0000 RDI: ffff967b23ddb500
RBP: ffffae8e00a0ba28 R08: 0000000000000000 R09: ffff967b23ddb500
R10: ffff967b23ddb598 R11: 0000000000000034 R12: 0000000000000000
R13: ffff967b23ddb500 R14: ffff967b1cfd7e00 R15: 0000000000000039
FS:  0000000000000000(0000) GS:ffff9681ffb80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000020 CR3: 000000075e410003 CR4: 00000000001706e0
Call Trace:
 <TASK>
 fzap_length+0x84/0xf0 [zfs]
 zap_length_uint64+0xe1/0x1c0 [zfs]
 ddt_zap_lookup+0x64/0xe0 [zfs]
 ddt_lookup+0x113/0x240 [zfs]
 ? abd_checksum_SHA256+0xd0/0xd0 [zfs]
 ? zio_checksum_compute+0x10c/0x560 [zfs]
 ? __kmalloc_node+0x166/0x3a0
 ? spl_kmem_alloc+0xb5/0x100 [spl]
 ? __cond_resched+0x1a/0x50
 zio_ddt_write+0x68/0x430 [zfs]
 zio_execute+0x97/0x160 [zfs]
 taskq_thread+0x29b/0x4c0 [spl]
 ? wake_up_q+0x90/0x90
 ? zio_gang_tree_free+0x70/0x70 [zfs]
 ? taskq_thread_spawn+0x60/0x60 [spl]
 kthread+0x12a/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x22/0x30
 </TASK>
Modules linked in: nvme_fabrics overlay lz4 lz4_compress ip6t_REJECT nf_reject_ipv6 zram nft_chain_nat xt_nat xt_MASQUERADE nf_nat xt_addrtype nft_limit nft_counter xt_LOG nf_log_syslog xt_limit xt_tcpudp xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 nft_compat nf_tables nfnetlink snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec intel_rapl_msr nls_iso8859_1 intel_rapl_common snd_hda_core snd_hwdep x86_pkg_temp_thermal intel_powerclamp snd_pcm mei_hdcp zfs(PO) snd_seq_midi zunicode(PO) snd_seq_midi_event zzstd(O) kvm_intel kvm zlua(O) zavl(PO) icp(PO) ch341 rapl snd_rawmidi intel_cstate usbserial snd_seq zcommon(PO) znvpair(PO) snd_seq_device input_leds spl(O) snd_timer snd at24 mei_me soundcore mei mac_hid tpm_infineon tcp_bbr sch_cake coretemp tcp_lp ip6_tables ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp parport mtd pstore_blk ramoops reed_solomon pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c dm_crypt dm_mirror dm_region_hash dm_log mlx4_ib ib_uverbs mlx4_en ib_core hid_generic usbhid hid uas usb_storage i915 i2c_algo_bit ttm mxm_wmi drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd drm mlx4_core ahci nvme libahci i2c_i801 r8169 nvme_core i2c_smbus lpc_ich realtek xhci_pci xhci_pci_renesas wmi video
CR2: 0000000000000020
---[ end trace d17346996de814e2 ]---
RIP: 0010:zap_leaf_lookup+0x4d/0x160 [zfs]
Code: 48 89 55 d0 48 8b 58 18 8b 87 d0 00 00 00 8d 48 fb 83 f9 1f 0f 87 7c 5d 0d 00 48 89 da 41 bf 45 00 00 00 49 8b 76 30 41 29 c7 <0f> b7 42 20 41 29 c7 41 83 ff 3f 0f 87 39 5d 0d 00 b8 01 00 00 00
RSP: 0018:ffffae8e00a0b9e8 EFLAGS: 00010216
RAX: 000000000000000c RBX: 0000000000000000 RCX: 0000000000000007
RDX: 0000000000000000 RSI: c17acc2d070f0000 RDI: ffff967b23ddb500
RBP: ffffae8e00a0ba28 R08: 0000000000000000 R09: ffff967b23ddb500
R10: ffff967b23ddb598 R11: 0000000000000034 R12: 0000000000000000
R13: ffff967b23ddb500 R14: ffff967b1cfd7e00 R15: 0000000000000039
FS:  0000000000000000(0000) GS:ffff9681ffb80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000020 CR3: 000000011c3e6005 CR4: 00000000001706e0
Avamander commented 1 year ago

I unplugged all ZFS drives, it's crashing in another location:

BUG: kernel NULL pointer dereference, address: 00000000000006c8
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0 
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 1 PID: 1105 Comm: agents Tainted: P           O      5.17.0-1013-oem #14-Ubuntu
Hardware name: MSI MS-7850/Z87-G41 PC Mate(MS-7850), BIOS V1.8 07/21/2014
RIP: 0010:mutex_lock+0x1e/0x40
Code: c3 cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc e8 0d e7 ff ff 31 c0 65 48 8b 14 25 c0 fb 01 00 <f0> 49 0f b1 14 24 75 07 4c 8b 65 f8 c9 c3 cc 4c 89 e7 e8 ab ff ff
RSP: 0018:ffffa647c0c17b98 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff8f20e1ddb300 RSI: 0000000000000000 RDI: 00000000000006c8
RBP: ffffa647c0c17ba0 R08: ffff8f20da80cea0 R09: ffff8f20da80cea0
R10: 0000000040000000 R11: 0000000000000000 R12: 00000000000006c8
R13: 00000000000006e8 R14: 00000000000006c8 R15: 0000000000000000
FS:  00007fc46af01640(0000) GS:ffff8f27bfa40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000006c8 CR3: 000000012c0d0002 CR4: 00000000001706e0
Call Trace:
 <TASK>
 rrw_enter_read_impl+0x23/0x180 [zfs]
 rrw_enter+0x1d/0x20 [zfs]
 dsl_pool_config_enter+0x1d/0x20 [zfs]
 spa_prop_get+0x92/0x860 [zfs]
 ? spl_kmem_free+0x2b/0x40 [spl]
 ? kfree+0x379/0x410
 ? mutex_lock+0x13/0x40
 ? spa_keystore_fini+0x69/0x90 [zfs]
 ? mutex_lock+0x13/0x40
 ? spa_deactivate+0x325/0x450 [zfs]
 ? spa_name_compare+0xe/0x30 [zfs]
 ? avl_find+0x6b/0xd0 [zavl]
 zfs_ioc_pool_get_props+0x7d/0x140 [zfs]
 zfsdev_ioctl_common+0x7bb/0x9e0 [zfs]
 ? _copy_from_user+0x2e/0x70
 zfsdev_ioctl+0x57/0xe0 [zfs]
 __x64_sys_ioctl+0x92/0xd0
 do_syscall_64+0x5c/0xc0
 ? asm_exc_page_fault+0x8/0x30
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc46bf8faff
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
RSP: 002b:00007fc46aefb440 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fc45c0025d0 RCX: 00007fc46bf8faff
RDX: 00007fc46aefb4a0 RSI: 0000000000005a27 RDI: 000000000000000b
RBP: 00007fc46aefea80 R08: 00007fc45c000000 R09: 00007fc45c02fca0
R10: 00007fc45c030000 R11: 0000000000000246 R12: 00007fc46aefb4a0
R13: 0000557848334340 R14: 0000000000000000 R15: 00007fc46aefeb20
 </TASK>
Modules linked in: overlay lz4 ip6t_REJECT lz4_compress nf_reject_ipv6 zram nft_chain_nat xt_nat xt_MASQUERADE nf_nat xt_addrtype nft_limit xt_LOG nf_log_syslog xt_limit xt_tcpudp xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 nft_compat nf_tables nfnetlink zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec intel_rapl_msr snd_hda_core intel_rapl_common x86_pkg_temp_thermal snd_hwdep nls_iso8859_1 intel_powerclamp snd_pcm snd_seq_midi kvm_intel snd_seq_midi_event mei_hdcp mei_pxp kvm snd_rawmidi snd_seq rapl snd_seq_device ch341 intel_cstate snd_timer usbserial mei_me snd input_leds at24 mei soundcore tpm_infineon mac_hid tcp_bbr sch_cake coretemp tcp_lp ip6_tables ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp ramoops pstore_blk parport mtd reed_solomon
 pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c dm_crypt dm_mirror dm_region_hash dm_log mlx4_ib ib_uverbs mlx4_en ib_core hid_generic usbhid hid uas usb_storage i915 i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cec aesni_intel rc_core ahci i2c_i801 crypto_simd mxm_wmi r8169 xhci_pci drm cryptd mlx4_core libahci i2c_smbus lpc_ich realtek xhci_pci_renesas wmi video
CR2: 00000000000006c8
---[ end trace 0000000000000000 ]---
RIP: 0010:mutex_lock+0x1e/0x40
Code: c3 cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc e8 0d e7 ff ff 31 c0 65 48 8b 14 25 c0 fb 01 00 <f0> 49 0f b1 14 24 75 07 4c 8b 65 f8 c9 c3 cc 4c 89 e7 e8 ab ff ff
RSP: 0018:ffffa647c0c17b98 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff8f20e1ddb300 RSI: 0000000000000000 RDI: 00000000000006c8
RBP: ffffa647c0c17ba0 R08: ffff8f20da80cea0 R09: ffff8f20da80cea0
R10: 0000000040000000 R11: 0000000000000000 R12: 00000000000006c8
R13: 00000000000006e8 R14: 00000000000006c8 R15: 0000000000000000
FS:  00007fc46af01640(0000) GS:ffff8f27bfa40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000006c8 CR3: 000000012c0d0002 CR4: 00000000001706e0
Avamander commented 1 year ago

I removed zfs.cache and reimported the pool, it is no longer crashing. Yay.

Though this occurrence is a bit spooky in three ways. It is bad that the file can even get corrupted like that (is it not written safely?). It's very bad that there is no validation reading it back in (some sanity checks, please). Lastly, it's terrible that the entire module can crash in so many different ways because of that.

I hope rest of the safety features compensated, but it certainly does not instil confidence.

behlendorf commented 1 year ago

@Avamander the crashes you observed occurred in unrelated areas of the code and both suggest kernel memory corruption. Is there anything else you changed on the system which might explain why this is no longer happening? It's hard to imagine how removing the cache file would have had any effect on this. It is written safely and rigorously validated.

Avamander commented 1 year ago

@behlendorf

Kernel memory corruption sounds incredibly unlikely unless it's ZFS "self-inflicting" it somehow. The crashes persisted and reoccured after reboots (in the same location between kernel and module versions) and it only happened if there was an attempt to mount the ZFS pool.

The only hardware change was to temporarily unplug the pool physically, just to see if ZFS remained stable without the pool. It didn't, the resulting crash is also visible above. That lead me to that file, deleting it made the crash disappear. Then I plugged the pool back in and I didn't encounter the first crash either.

Neither of the crashes have reoccured for 1 day and 2 hours and hopefully the scrub finishes successfully.

Avamander commented 1 year ago

It finally imported after four days using an older tgx using -T, it started scrubbing and then hung.

Now I'm seeing something very similar to this: https://github.com/openzfs/zfs/issues/7603#issuecomment-1128777521

Similar backtrace:

BUG: unable to handle page fault for address: 00000000000032b8
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0 
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 2 PID: 1050 Comm: txg_sync Tainted: P           O      5.17.0-1014-oem #15-Ubuntu
Hardware name: MSI MS-7850/Z87-G41 PC Mate(MS-7850), BIOS V1.8 07/21/2014
RIP: 0010:mutex_lock+0x1e/0x40
Code: c3 cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc e8 0d e7 ff ff 31 c0 65 48 8b 14 25 c0 fb 01 00 <f0> 49 0f b1 14 24 75 07 4c 8b 65 f8 c9 c3 cc 4c 89 e7 e8 ab ff ff
RSP: 0018:ffffaa5ec27935a8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000020 RCX: 802a070200070007
RDX: ffff9ef272190000 RSI: 0000000000ff949f RDI: 00000000000032b8
RBP: ffffaa5ec27935b0 R08: 0000000000028000 R09: 00000c04dbec2000
R10: ffff9ef2c9ff6760 R11: ffffaa5ec2793768 R12: 00000000000032b8
R13: ffff9ef322949738 R14: 0000000000000000 R15: 00000001fffffe00
FS:  0000000000000000(0000) GS:ffff9ef93fa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000032b8 CR3: 0000000590c10002 CR4: 00000000001706e0
Call Trace:
 <TASK>
 dsl_scan_scrub_cb+0x4ae/0x940 [zfs]
 ? __kmalloc_node+0x1c4/0x3e0
 ? ktime_get_raw_ts64+0x47/0xd0
 dsl_scan_visitbp.isra.0+0x739/0xce0 [zfs]
 dsl_scan_visitbp.isra.0+0x3c2/0xce0 [zfs]
 dsl_scan_visitbp.isra.0+0x3c2/0xce0 [zfs]
 dsl_scan_visitbp.isra.0+0x3c2/0xce0 [zfs]
 dsl_scan_visitbp.isra.0+0x3c2/0xce0 [zfs]
 dsl_scan_visitbp.isra.0+0x634/0xce0 [zfs]
 dsl_scan_visitbp.isra.0+0x3c2/0xce0 [zfs]
 dsl_scan_visitbp.isra.0+0x3c2/0xce0 [zfs]
 dsl_scan_visitbp.isra.0+0x888/0xce0 [zfs]
 dsl_scan_visit_rootbp.isra.0+0x125/0x1b0 [zfs]
 dsl_scan_sync+0x11c0/0x13b0 [zfs]
 spa_sync+0x5c6/0x1010 [zfs]
 ? spa_txg_history_init_io+0x107/0x110 [zfs]
 txg_sync_thread+0x2bf/0x450 [zfs]
 ? txg_register_callbacks+0xb0/0xb0 [zfs]
 ? __thread_exit+0x20/0x20 [spl]
 thread_generic_wrapper+0x64/0x70 [spl]
 kthread+0xee/0x120
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x22/0x30
 </TASK>
Modules linked in: nvme_fabrics overlay ip6t_REJECT nf_reject_ipv6 nft_chain_nat xt_nat lz4 lz4_compress zram xt_MASQUERADE nf_nat xt_addrtype nft_limit xt_LOG nf_log_syslog xt_limit xt_tcpudp xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 nft_compat nf_tables nfnetlink snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio zfs(PO) snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core zunicode(PO) snd_hwdep intel_rapl_msr nls_iso8859_1 intel_rapl_common snd_pcm zzstd(O) x86_pkg_temp_thermal snd_seq_midi zlua(O) intel_powerclamp snd_seq_midi_event zavl(PO) snd_rawmidi kvm_intel mei_hdcp icp(PO) mei_pxp snd_seq kvm snd_seq_device zcommon(PO) rapl snd_timer znvpair(PO) ch341 intel_cstate spl(O) snd usbserial input_leds mei_me at24 soundcore mei tpm_infineon tcp_bbr mac_hid sch_cake coretemp tcp_lp ip6_tables ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp ramoops parport mtd
 reed_solomon pstore_blk efi_pstore pstore_zone ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c dm_crypt dm_mirror dm_region_hash dm_log mlx4_ib ib_uverbs mlx4_en ib_core hid_generic usbhid uas usb_storage hid i915 i2c_algo_bit ttm drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect crc32_pclmul sysimgblt ghash_clmulni_intel fb_sys_fops cec aesni_intel rc_core nvme ahci i2c_i801 crypto_simd mxm_wmi r8169 xhci_pci drm mlx4_core nvme_core libahci i2c_smbus lpc_ich cryptd realtek xhci_pci_renesas wmi video
CR2: 00000000000032b8
---[ end trace 0000000000000000 ]---
RIP: 0010:mutex_lock+0x1e/0x40
Code: c3 cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc e8 0d e7 ff ff 31 c0 65 48 8b 14 25 c0 fb 01 00 <f0> 49 0f b1 14 24 75 07 4c 8b 65 f8 c9 c3 cc 4c 89 e7 e8 ab ff ff
RSP: 0018:ffffaa5ec27935a8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000020 RCX: 802a070200070007
RDX: ffff9ef272190000 RSI: 0000000000ff949f RDI: 00000000000032b8
RBP: ffffaa5ec27935b0 R08: 0000000000028000 R09: 00000c04dbec2000
R10: ffff9ef2c9ff6760 R11: ffffaa5ec2793768 R12: 00000000000032b8
R13: ffff9ef322949738 R14: 0000000000000000 R15: 00000001fffffe00
FS:  0000000000000000(0000) GS:ffff9ef93fa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000032b8 CR3: 0000000122108004 CR4: 00000000001706e0
Avamander commented 1 year ago

Surprisingly fragile for a supposedly failure-resistant filesystem.

PaulZ-98 commented 1 year ago

Could this be stack overflow, given the 9 levels of dsl_scan_visitbp recursion?

hmaarrfk commented 11 months ago
[30094.728714] general protection fault, probably for non-canonical address 0xfbff895d6de93330: 0000 [#1] PREEMPT SMP NOPTI                                                                                                             
[30094.728731] CPU: 4 PID: 3880 Comm: z_wr_int_2 Tainted: P           OE     5.19.0-50-generic #50-Ubuntu           
[30094.728736] Hardware name: ASUS System Product Name/ProArt X570-CREATOR WIFI, BIOS 1201 04/19/2023               
[30094.728738] RIP: 0010:zio_done+0x4ab/0x1270 [zfs]                                                                
[30094.728849] Code: 48 89 45 b8 e9 fe 00 00 00 49 8b 8f 30 01 00 00 48 8b 1c 0a 48 39 5d c0 0f 84 12 01 00 00 48 29 cb 48 85 db 0f 84 7d 0b 00 00 <48> 8b 03 48 89 45 d0 4c 89 fe 4c 89 f7 e8 93 62 ff ff 45 8b 6f 74                  
[30094.728851] RSP: 0018:ffffa2de9ff53d40 EFLAGS: 00010286                                                          
[30094.728853] RAX: 0000000000000000 RBX: fbff895d6de93330 RCX: 0000000000000010                                    
[30094.728854] RDX: ffff895d6de931b0 RSI: 0000000000000000 RDI: 0000000000000000                                    
[30094.728855] RBP: ffffa2de9ff53da0 R08: 0000000000000000 R09: 0000000000000000                                    
[30094.728856] R10: 0000000000000000 R11: 0000000000000000 R12: ffff89545325d1c8                                    
[30094.728857] R13: 0000000000200000 R14: ffff895fbe739860 R15: ffff896192e49d40                                    
[30094.728858] FS:  0000000000000000(0000) GS:ffff89722e100000(0000) knlGS:0000000000000000                         
[30094.728859] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                    
[30094.728860] CR2: 000000c001241000 CR3: 0000000115ed0000 CR4: 0000000000750ee0                                    
[30094.728862] PKRU: 55555554                                                                                       
[30094.728862] Call Trace:                                                                                          
[30094.728864]  <TASK>                                                                                              
[30094.728867]  zio_execute+0x97/0x170 [zfs]                                                                        
[30094.728913]  taskq_thread+0x2aa/0x4d0 [spl]                                                                      
[30094.728918]  ? wake_up_q+0xa0/0xa0                                                                               
[30094.728923]  ? zio_gang_tree_free+0x70/0x70 [zfs]                                                                
[30094.728962]  ? taskq_thread_spawn+0x60/0x60 [spl]                                                                
[30094.728966]  kthread+0xee/0x120                                                                                  
[30094.728968]  ? kthread_complete_and_exit+0x20/0x20                                                               
[30094.728970]  ret_from_fork+0x22/0x30                                                                             
[30094.728973]  </TASK> 
[30094.728974] Modules linked in: nvme_fabrics rfcomm cmac algif_hash algif_skcipher af_alg bnep binfmt_misc nvidia_uvm(POE) snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg nvidia_drm(POE) snd_intel_sdw_acpi snd_hda_codec nvidia_modeset(POE) zfs(POE) intel_rapl_msr snd_hda_core zunicode(POE) intel_rapl_common snd_hwdep zzstd(OE) snd_pcm iwlmvm edac_mce_amd snd_seq_midi zlua(OE) snd_seq_midi_event zavl(POE) nvidia(POE) nls_iso8859_1 snd_rawmidi asus_ec_sensors icp(POE) mac80211 btusb kvm snd_seq btrtl libarc4 crct10dif_pclmul ghash_clmulni_intel btbcm snd_seq_device zcommon(POE) drm_kms_helper snd_timer aesni_intel btintel ucsi_c
cg crypto_simd btmtk fb_sys_fops znvpair(POE) cryptd iwlwifi snd typec_ucsi syscopyarea sysfillrect rapl wmi_bmof sp
l(OE) input_leds asus_nb_wmi eeepc_wmi intel_wmi_thunderbolt joydev k10temp ccp typec bluetooth soundcore sysimgblt 
cfg80211 ecdh_generic ecc mac_hid sch_fq_codel msr parport_pc ppdev lp
[30094.729011]  parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore drm ip_tables x_tables autofs4 raid10
 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath lin
ear hid_generic usbhid hid mfd_aaeon asus_wmi sparse_keymap ixgbe nvme xfrm_algo platform_profile crc32_pclmul atlan
tic i2c_nvidia_gpu xhci_pci i2c_piix4 i2c_ccgx_ucsi dca ahci igc macsec thunderbolt xhci_pci_renesas nvme_core libah
ci mdio wmi video
[30094.729066] ---[ end trace 0000000000000000 ]---
[30094.842553] RIP: 0010:zio_done+0x4ab/0x1270 [zfs]
[30094.842645] Code: 48 89 45 b8 e9 fe 00 00 00 49 8b 8f 30 01 00 00 48 8b 1c 0a 48 39 5d c0 0f 84 12 01 00 00 48 29
 cb 48 85 db 0f 84 7d 0b 00 00 <48> 8b 03 48 89 45 d0 4c 89 fe 4c 89 f7 e8 93 62 ff ff 45 8b 6f 74
[30094.842647] RSP: 0018:ffffa2de9ff53d40 EFLAGS: 00010286 
[30094.842650] RAX: 0000000000000000 RBX: fbff895d6de93330 RCX: 0000000000000010
[30094.842651] RDX: ffff895d6de931b0 RSI: 0000000000000000 RDI: 0000000000000000
[30094.842652] RBP: ffffa2de9ff53da0 R08: 0000000000000000 R09: 0000000000000000
[30094.842653] R10: 0000000000000000 R11: 0000000000000000 R12: ffff89545325d1c8
[30094.842654] R13: 0000000000200000 R14: ffff895fbe739860 R15: ffff896192e49d40
[30094.842656] FS:  0000000000000000(0000) GS:ffff89722e100000(0000) knlGS:0000000000000000
[30094.842657] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30094.842659] CR2: 000000c001241000 CR3: 0000000115ed0000 CR4: 0000000000750ee0
[30094.842660] PKRU: 55555554

I'm seem to be getting a similar error. in my case, I'm downloading from 3 rclone processes, each downloading 8 files simultaneously,