openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.63k stars 1.75k forks source link

BUG: kernel NULL pointer dereference in zap_lockdir #11804

Open jjakob opened 3 years ago

jjakob commented 3 years ago

System information

Type Version/Name
Distribution Name Proxmox VE
Distribution Version 6
Linux Kernel 5.4.103-1-pve
Architecture x86_64
ZFS Version 2.0.3-pve
SPL Version 2.0.3-pve2

Describe the problem you're observing

The bug happened after a nightly syncoid run (that uses zfs send and zfs receive) which froze the filesystem until the system was rebooted.

Describe how to reproduce the problem

I could not reproduce the bug on a newer kernel/zfs version (kernel 5.4.106-1-pve, zfs 2.0.4-pve1), but the bug happens only sometimes, so I'll see over time if it still occurs.

Include any warning/errors/backtraces from the system logs

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP NOPTI
CPU: 0 PID: 21353 Comm: zfs Tainted: P           O      5.4.103-1-pve #1
Hardware name: HPE ProLiant MicroServer Gen10/ProLiant MicroServer Gen10, BIOS 5.12 02/19/2020
RIP: 0010:zap_lockdir_impl+0x284/0x750 [zfs]
Code: d2 75 92 4d 89 b7 98 00 00 00 48 8b 75 80 48 89 df e8 50 2e f4 ff eb 97 48 8b 43 18 48 c7 c2 18 08 7a c0 31 f6 bf 28 01 00 00 <48> 8
48 89 8d 58 ff ff ff b9 a3 01 00 00 48 89 85
RSP: 0018:ffffb298d0f3b748 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9d84afbc7980 RCX: 0000000000000000
RDX: ffffffffc07a0818 RSI: 0000000000000000 RDI: 0000000000000128
RBP: ffffb298d0f3b800 R08: 0000000000000001 R09: 0000000000000000
R10: 2000000000000000 R11: 0000008000000000 R12: ffffb298d0f3b888
R13: 0000000000000002 R14: ffff9d84d2e4e800 R15: 0000000000000000
FS:  00007f1e2a3da7c0(0000) GS:ffff9d863aa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000002580f8000 CR4: 00000000001406f0
Call Trace:
 zap_lockdir+0x8b/0xb0 [zfs]
 zap_lookup_norm+0x5d/0xc0 [zfs]
 zap_lookup+0x16/0x20 [zfs]
 zfs_get_zplprop+0x9d/0x190 [zfs]
 dmu_send_impl+0xf4e/0x1500 [zfs]
 ? dsl_dataset_feature_is_active+0x50/0x50 [zfs]
 ? dnode_rele_and_unlock+0x68/0xe0 [zfs]
 ? dnode_rele+0x3b/0x40 [zfs]
 ? dbuf_rele_and_unlock+0x306/0x6a0 [zfs]
 ? dbuf_rele+0x3b/0x40 [zfs]
 ? dmu_buf_rele+0xe/0x10 [zfs]
 dmu_send_obj+0x245/0x350 [zfs]
 zfs_ioc_send+0x11d/0x2c0 [zfs]
 ? zfs_ioc_send+0x2c0/0x2c0 [zfs]
 zfsdev_ioctl_common+0x5b2/0x820 [zfs]
 ? __kmalloc_node+0x267/0x330
 ? spl_kmem_free+0x33/0x40 [spl]
 zfsdev_ioctl+0x54/0xe0 [zfs]
 do_vfs_ioctl+0xa9/0x640
 ? handle_mm_fault+0xc9/0x1f0
 ksys_ioctl+0x67/0x90
 __x64_sys_ioctl+0x1a/0x20
 do_syscall_64+0x57/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f1e2a2c5427
Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd9295c0d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000055fffc274350 RCX: 00007f1e2a2c5427
RDX: 00007ffd9295c110 RSI: 0000000000005a1c RDI: 0000000000000003
RBP: 00007ffd9295fb00 R08: 000055fffc2e1a50 R09: 000000000000000f
R10: 000055fffc274300 R11: 0000000000000246 R12: 00007ffd9295c110
R13: 000055fffc274360 R14: 0000000000000000 R15: 000000000001790f
Modules linked in: binfmt_misc dm_crypt twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx2
4 serpent_sse2_x86_64 serpent_generic algif_skcipher af_alg tcp_diag udp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter bonding softdog nfnetlink_log nfnetlink amd64_edac_mod edac_mce_amd kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd amdgpu cryptd glue_helper amd_iommu_v2 gpu_sched ttm pcspkr fam15h_power k10temp drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt 8250_dw mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi jc42 sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO)
) btrfs xor zstd_compress hid_generic raid6_pq libcrc32c
 usbhid hid uas usb_storage i2c_piix4 xhci_pci ehci_pci ehci_hcd xhci_hcd ahci tg3 libahci video
CR2: 0000000000000000
---[ end trace a5fc98b08b735453 ]---
RIP: 0010:zap_lockdir_impl+0x284/0x750 [zfs]
Code: d2 75 92 4d 89 b7 98 00 00 00 48 8b 75 80 48 89 df e8 50 2e f4 ff eb 97 48 8b 43 18 48 c7 c2 18 08 7a c0 31 f6 bf 28 01 00 00 <48> 8b 08 48 8b 40 08 48 89 8d 58 ff ff ff b9 a3 01 00 00 48 89 85
RSP: 0018:ffffb298d0f3b748 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9d84afbc7980 RCX: 0000000000000000
RDX: ffffffffc07a0818 RSI: 0000000000000000 RDI: 0000000000000128
RBP: ffffb298d0f3b800 R08: 0000000000000001 R09: 0000000000000000
R10: 2000000000000000 R11: 0000008000000000 R12: ffffb298d0f3b888
R13: 0000000000000002 R14: ffff9d84d2e4e800 R15: 0000000000000000
FS:  00007f1e2a3da7c0(0000) GS:ffff9d863aa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000002580f8000 CR4: 00000000001406f0
brenc commented 3 years ago

This happened to me this morning on a recently built Promox VE 7 server running sanoid/syncoid every minute. FreeNAS Mini server which uses an ASRock Rack C2750D4I. I've had two of these servers running on FreeBSD 12.2 for years w/o issue. Just recently converted them over to Proxmox/ZFS on Linux.

Almost all zfs/zpool commands froze completely (ctrl-c did not work) though I was able to run a zpool status. I have three pools and for some reason my main storage pool reported this:

  pool: storage
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 1.45T in 05:41:01 with 0 errors on Mon Sep 13 21:55:24 2021
config:

        NAME                            STATE     READ WRITE CKSUM
        storage                         ONLINE       0     0     0
          raidz2-0                      ONLINE       0     0     0
            ata-WDC_WD60EFZX-68B3FN0_1  ONLINE       0     0     0
            ata-WDC_WD60EFRX-68MYMN1_2  ONLINE       0     0     0
            ata-WDC_WD60EFRX-68L0BN1_3  ONLINE       0     0     0
            ata-WDC_WD60EFRX-68MYMN1_4  ONLINE       0     0     0
            ata-WDC_WD60EFRX-68L0BN1_5  ONLINE       0     0     0
            ata-WDC_WD60EFRX-68L0BN1_6  ONLINE       0     0     0
            ata-WDC_WD60EFRX-68L0BN1_7  ONLINE       0     0     0
            ata-WDC_WD60EFRX-68MYMN1_8  ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list

This was a freshly built pool as of a few weeks ago. The only thing I've done to it since I built it was to swap out what I found out to be an SMR drive (all drives are now CMR). As you can see, the resilver completed successfully. No errors found.

The filesystems could still be accessed, though accessing a snapshot directory hung the ls command.

When I ran zpool status -v it hung indefinitely. I had to reboot the box. When the box came back, zpool status reported no errors whatsoever. I'm running a scrub now.

I could find no other errors in any of the logs.

# uname -a
Linux vmh01 5.11.22-3-pve #1 SMP PVE 5.11.22-7 (Wed, 18 Aug 2021 15:06:12 +0200) x86_64 GNU/Linux

# zfs version
zfs-2.0.5-pve1
zfs-kmod-2.0.5-pve1
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
SMP PTI
CPU: 2 PID: 2767864 Comm: zfs Tainted: P           O      5.11.22-3-pve #1
Hardware name: iXsystems FREENAS-MINI-2.0/C2750D4I, BIOS P3.20 03/26/2018
RIP: 0010:zap_lockdir_impl+0x2a6/0x7b0 [zfs]
Code: 86 d8 00 00 00 00 00 00 00 e8 f6 7e f3 e2 4d 89 ae 98 00 00 00 e9 47 fe ff ff 48 8b 43 18 b9 a3 01 00 00 31 f6 bf 28 01 00 00 <48> 8b 10 48 8b 40 08 48 89 95 58 ff ff ff 48 c7 c2 18 65 8a c0 48
RSP: 0018:ffffa5b24dcfb830 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8db0fe204d80 RCX: 00000000000001a3
RDX: ffff8db056839700 RSI: 0000000000000000 RDI: 0000000000000128
RBP: ffffa5b24dcfb8f0 R08: 0000000000000001 R09: 0000000000000000
R10: 2000000000000000 R11: 0000008000000000 R12: 0000000000000002
R13: ffff8db1e3d64800 R14: 0000000000000000 R15: ffffa5b24dcfb968
FS:  00007fa1d9c747c0(0000) GS:ffff8db69fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000058b98e000 CR4: 00000000001026e0
Call Trace:
 ? dbuf_read+0x128/0x550 [zfs]
 zap_lockdir+0x8c/0xb0 [zfs]
 zap_lookup+0x50/0x100 [zfs]
 zfs_get_zplprop+0xb7/0x1a0 [zfs]
 dmu_send_impl+0xf01/0x1550 [zfs]
 ? dnode_rele+0x3d/0x50 [zfs]
 ? dbuf_rele_and_unlock+0x1fd/0x610 [zfs]
 dmu_send_obj+0x24c/0x360 [zfs]
 zfs_ioc_send+0x11d/0x2c0 [zfs]
 ? zfs_ioc_send+0x2c0/0x2c0 [zfs]
 zfsdev_ioctl_common+0x72c/0x940 [zfs]
 ? __check_object_size+0x5d/0x150
 zfsdev_ioctl+0x57/0xe0 [zfs]
 __x64_sys_ioctl+0x91/0xc0
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fa1da24ecc7
Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffdc3efca18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000056233d62caa0 RCX: 00007fa1da24ecc7
RDX: 00007ffdc3efca50 RSI: 0000000000005a1c RDI: 0000000000000003
RBP: 00007ffdc3f00440 R08: 000000000000000f R09: 0000000000000028
R10: 0000000000000080 R11: 0000000000000246 R12: 00007ffdc3efca50
R13: 00007ffdc3f05270 R14: 000056233d62cab0 R15: 0000000000000000
Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp ip6_udp_tunnel udp_tunnel iptable_filter bpfilter softdog nfnetlink_log nfnetlink ipmi_ssif intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper intel_cstate pcspkr at24 joydev input_leds ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cec rc_core fb_sys_fops syscopyarea sysfillrect sysimgblt acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core nfsd iscsi_tcp libiscsi_tcp libiscsi auth_rpcgss scsi_transport_iscsi nfs_acl lockd grace drm sunrpc nfs_ssc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbkbd usbmouse usbhid hid gpio_ich mpt3sas i2c_i801 ehci_pci igb raid_class
 ahci i2c_algo_bit crc32_pclmul i2c_smbus lpc_ich ehci_hcd i2c_ismt dca libahci scsi_transport_sas
CR2: 0000000000000000
---[ end trace bfdfd0e448c4e64f ]---
RIP: 0010:zap_lockdir_impl+0x2a6/0x7b0 [zfs]
Code: 86 d8 00 00 00 00 00 00 00 e8 f6 7e f3 e2 4d 89 ae 98 00 00 00 e9 47 fe ff ff 48 8b 43 18 b9 a3 01 00 00 31 f6 bf 28 01 00 00 <48> 8b 10 48 8b 40 08 48 89 95 58 ff ff ff 48 c7 c2 18 65 8a c0 48
RSP: 0018:ffffa5b24dcfb830 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8db0fe204d80 RCX: 00000000000001a3
RDX: ffff8db056839700 RSI: 0000000000000000 RDI: 0000000000000128
RBP: ffffa5b24dcfb8f0 R08: 0000000000000001 R09: 0000000000000000
R10: 2000000000000000 R11: 0000008000000000 R12: 0000000000000002
R13: ffff8db1e3d64800 R14: 0000000000000000 R15: ffffa5b24dcfb968
FS:  00007fa1d9c747c0(0000) GS:ffff8db69fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000058b98e000 CR4: 00000000001026e0
stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

zig commented 2 years ago

Hi, I seem to have hit a similar issue while doing send. The system is ubuntu 22.04 with stock ZFS (2.1.4-0ubuntu0.1), encryption is enabled.

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] SMP NOPTI
CPU: 9 PID: 2985410 Comm: zfs Tainted: P           O      5.15.0-48-generic #54-Ubuntu
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./E3C246D4U2-2T, BIOS L2.02P 10/19/2020
RIP: 0010:mzap_open+0x37/0x330 [zfs]
Code: e5 41 57 49 89 f7 31 f6 41 56 41 55 41 54 53 48 89 d3 48 83 ec 10 48 8b 42 18 48 89 7d d0 48 c7 c2 a8 d3 e6 c0 bf 28 01 00 00 <4c> 8b 30 48 8b 40 08 48 89 45 c8 e8 89 1b 6a ff 48 c7 c2 0c 8d f2
RSP: 0018:ffffa6dd8ed937c8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff8dc0d6897000 RCX: 00000000000001a1
RDX: ffffffffc0e6d3a8 RSI: 0000000000000000 RDI: 0000000000000128
RBP: ffffa6dd8ed93800 R08: 0000000000000000 R09: 0020000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff8dbfc4cef800
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS:  00007f60a5e8f7c0(0000) GS:ffff8dc67f640000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000b129fa003 CR4: 00000000003726e0
Call Trace:
 <TASK>
 zap_lockdir_impl+0x2cd/0x3a0 [zfs]
 zap_lockdir+0x92/0xb0 [zfs]
 zap_lookup_norm+0x5c/0xd0 [zfs]
 ? dnode_special_open+0x4d/0x90 [zfs]
 zap_lookup+0x16/0x20 [zfs]
 zfs_get_zplprop+0xb7/0x1b0 [zfs]
 setup_featureflags+0x21b/0x260 [zfs]
 dmu_send_impl+0xdd/0xbf0 [zfs]
 ? dnode_rele+0x39/0x50 [zfs]
 ? dmu_buf_rele+0xe/0x20 [zfs]
 ? zap_unlockdir+0x46/0x60 [zfs]
 dmu_send_obj+0x265/0x340 [zfs]
 zfs_ioc_send+0xe8/0x2c0 [zfs]
 ? dump_bytes_cb+0x30/0x30 [zfs]
 zfsdev_ioctl_common+0x683/0x740 [zfs]
 ? __check_object_size.part.0+0x4a/0x150
 ? _copy_from_user+0x2e/0x70
 zfsdev_ioctl+0x57/0xf0 [zfs]
 __x64_sys_ioctl+0x92/0xd0
 do_syscall_64+0x59/0xc0
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f60a670eaff
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
RSP: 002b:00007ffe33c53e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000055d027f23bc0 RCX: 00007f60a670eaff
RDX: 00007ffe33c54300 RSI: 0000000000005a1c RDI: 0000000000000003
RBP: 00007ffe33c578f0 R08: 000055d027f27ba0 R09: 0000000000000000
R10: 0000000000000009 R11: 0000000000000246 R12: 000055d027f22660
R13: 00007ffe33c54300 R14: 000055d027f23bd0 R15: 0000000000000000
 </TASK>
Modules linked in: tls nvme_fabrics xt_nat vhost_net vhost vhost_iotlb tap ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntr>
 i2c_algo_bit drm_ttm_helper ttm drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops aesni_intel cec rc_core crypto_simd cryptd ixgbe nvme intel_lpss_pci drm xfrm_algo i2c_i801 a>
CR2: 0000000000000000
---[ end trace 63e074585d96fd8a ]---
RIP: 0010:mzap_open+0x37/0x330 [zfs]
Code: e5 41 57 49 89 f7 31 f6 41 56 41 55 41 54 53 48 89 d3 48 83 ec 10 48 8b 42 18 48 89 7d d0 48 c7 c2 a8 d3 e6 c0 bf 28 01 00 00 <4c> 8b 30 48 8b 40 08 48 89 45 c8 e8 89 1b 6a ff 48 c7 c2 0c 8d f2
RSP: 0018:ffffa6dd8ed937c8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff8dc0d6897000 RCX: 00000000000001a1
RDX: ffffffffc0e6d3a8 RSI: 0000000000000000 RDI: 0000000000000128
RBP: ffffa6dd8ed93800 R08: 0000000000000000 R09: 0020000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff8dbfc4cef800
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS:  00007f60a5e8f7c0(0000) GS:ffff8dc67f640000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000b129fa003 CR4: 00000000003726e0

Four days ago I upgraded from ubuntu 20.04 with stock ZFS 0.8 which never had any issue in more than 2 years of heavy usage (snapshots of VM zvol devices sent to backup server every 15 minutes).

After the upgrade, I did a "zpool upgrade -a". Everything seemed OK until this morning where I found the machine stuck while doing a regular "send".

zig commented 2 years ago

Additional information : for two days prior to the crash, syncoid was failing to send snapshots of a particular path, more precisely the receiver was closing because of "invalid backup stream":

sanoid[1559281]: Sending incremental rpool/ROOT/ubuntu_wrbx68/var/log@autosnap_2022-10-11_01:00:06_hourly ... syncoid_marseille1_XXXovh1_2022-10-11:01:02
:04-GMT00:00 (~ 27.7 MB):
sanoid[1562717]: warning: cannot send 'rpool/ROOT/ubuntu_wrbx68/var/log@syncoid_marseille1_XXXovh1_2022-10-11:01:00:08-GMT00:00': Input/output error
zed[1562857]: eid=16005 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=55451:0:0:869
zed[1562860]: eid=16006 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=55451:0:0:12
zed[1562862]: eid=16007 class=authentication pool='rpool' priority=2 err=5 flags=0x80 bookmark=55451:0:0:12
zed[1562864]: eid=16008 class=data pool='rpool' priority=2 err=5 flags=0x80 bookmark=55451:0:0:12
sanoid[1562715]: cannot receive incremental stream: invalid backup stream
sanoid[1562719]: zstd: error 25 : Write error : Broken pipe (cannot write compressed block)
sanoid[1559281]: CRITICAL ERROR:  zfs send  -I 'rpool/ROOT/ubuntu_wrbx68/var/log'@'autosnap_2022-10-11_01:00:06_hourly' 'rpool/ROOT/ubuntu_wrbx68/var/log'@'syncoid_marseille1_XXXovh1_2022-10-11:01:02:04-GMT00:00' | pv -p -t -e -r -b -s 29023416 | zstd -3 | mbuffer  -q -s 128k -m 16M 2>/dev/null | ssh     -S /tmp/syncoid-ovh1backup@marseille1-1665450113 ovh1backup@marseille1 ' mbuffer  -q -s 128k -m 16M 2>/dev/null | zstd -dc | sudo zfs receive   -F '"'"'tank/CRYPT/OVH1/rpool/ROOT/ubuntu_wrbx68/var/log'"'"' 2>&1' failed: 256 at /usr/sbin/syncoid line 817.

Indeed, "zpool status" was reporting one snapshot of this path to be corrupted. I have since then started a scrub and while "zpool status" still reports a permanent error, it's now on an anonymous file and not anymore linked to the snapshot, furthermore syncoid does not fail anymore. Hopefully the interrupted "send" was the trigger of the kernel crash and I won't get it anymore anytime soon.

Derkades commented 1 year ago

Encountered this today during a raw encrypted zfs send/recv using syncoid. File operations still work, but zpool/zfs commands are unresponsive.

Debian bullseye, using ZFS 2.1.9 and Linux kernel 6.0.12 from backports

zfs-2.1.9-1~bpo11+1
zfs-kmod-2.1.9-1~bpo11+1
[72358.039249] BUG: kernel NULL pointer dereference, address: 0000000000000000
[72358.039266] #PF: supervisor read access in kernel mode
[72358.039273] #PF: error_code(0x0000) - not-present page
[72358.039280] PGD 0 P4D 0 
[72358.039286] Oops: 0000 [#1] PREEMPT SMP NOPTI
[72358.039293] CPU: 2 PID: 1039602 Comm: zfs Tainted: P           OE      6.0.0-0.deb11.6-amd64 #1  Debian 6.0.12-1~bpo11+1
[72358.039305] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 6042 04/28/2022
[72358.039315] RIP: 0010:zap_lockdir_impl+0x29e/0x750 [zfs]
[72358.039383] Code: 86 d8 00 00 00 00 00 00 00 e8 ce a6 14 db 4d 89 a6 98 00 00 00 e9 48 fe ff ff 48 8b 43 18 b9 a1 01 00 00 31 f6 bf 28 01 00 00 <48> 8b 10 48 8b 40 08 48 89 54 24 20 48 c7 c2 a8 58 69 c1 48 89 44
[72358.039398] RSP: 0018:ffffa970d57e3850 EFLAGS: 00010246
[72358.039406] RAX: 0000000000000000 RBX: ffff96927bb6c780 RCX: 00000000000001a1
[72358.039414] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000128
[72358.039422] RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000
[72358.039430] R10: 2000000000000000 R11: 0000008000000000 R12: ffff9692a17cb800
[72358.039439] R13: 0000000000000001 R14: 0000000000000000 R15: ffffa970d57e3980
[72358.039448] FS:  00007f218d70a7c0(0000) GS:ffff9697ae480000(0000) knlGS:0000000000000000
[72358.039457] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[72358.039464] CR2: 0000000000000000 CR3: 000000084dd04000 CR4: 0000000000750ee0
[72358.039472] PKRU: 55555554
[72358.039477] Call Trace:
[72358.039483]  <TASK>
[72358.039491]  ? zio_wait+0x1fc/0x2b0 [zfs]
[72358.039544]  zap_lockdir+0x90/0xc0 [zfs]
[72358.039584]  ? _raw_spin_unlock+0x15/0x30
[72358.039592]  zap_lookup+0x48/0x100 [zfs]
[72358.039629]  zfs_get_zplprop+0xb3/0x1a0 [zfs]
[72358.039671]  dmu_send_impl+0xe56/0x1450 [zfs]
[72358.039708]  ? _raw_spin_unlock_irqrestore+0x23/0x40
[72358.039716]  ? dsl_dataset_disown+0x90/0x90 [zfs]
[72358.039753]  ? taskq_dispatch_ent+0x101/0x1b0 [spl]
[72358.039763]  ? preempt_count_add+0x70/0xa0
[72358.039771]  ? _raw_spin_lock+0x13/0x40
[72358.039777]  ? _raw_spin_unlock+0x15/0x30
[72358.039783]  ? dbuf_rele_and_unlock+0x20a/0x690 [zfs]
[72358.039818]  dmu_send_obj+0x252/0x340 [zfs]
[72358.039852]  ? ttwu_queue_wakelist+0xbc/0x100
[72358.039862]  zfs_ioc_send+0xe8/0x2d0 [zfs]
[72358.039905]  ? zfs_ioc_send+0x2d0/0x2d0 [zfs]
[72358.039941]  zfsdev_ioctl_common+0x750/0x9b0 [zfs]
[72358.039977]  ? kmalloc_large_node+0x70/0x80
[72358.039985]  ? __kmalloc_node+0x2cc/0x390
[72358.039991]  zfsdev_ioctl+0x53/0xe0 [zfs]
[72358.040032]  __x64_sys_ioctl+0x8b/0xc0
[72358.040039]  do_syscall_64+0x3b/0xc0
[72358.040050]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[72358.040054] RIP: 0033:0x7f218d9f85f7
[72358.040059] Code: 00 00 00 48 8b 05 99 c8 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 c8 0d 00 f7 d8 64 89 01 48
[72358.040074] RSP: 002b:00007ffd2e8f2ba8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[72358.040083] RAX: ffffffffffffffda RBX: 0000563e40b1bcf0 RCX: 00007f218d9f85f7
[72358.040091] RDX: 00007ffd2e8f2be0 RSI: 0000000000005a1c RDI: 0000000000000003
[72358.040099] RBP: 00007ffd2e8f65d0 R08: 000000000000000f R09: 0000000000000000
[72358.040407] R10: 0000000000000009 R11: 0000000000000246 R12: 0000563e40ba7910
[72358.040673] R13: 00007ffd2e8f2be0 R14: 0000563e40b1bd00 R15: 0000000000000000
[72358.040917]  </TASK>
[72358.041144] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha sctp ip6_udp_tunnel udp_tunnel veth xt_nat xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter macvtap macvlan vhost_net vhost vhost_iotlb tap tun ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink bridge lz4 lz4_compress zram stp llc overlay zsmalloc intel_rapl_msr intel_rapl_common amdgpu nls_ascii nls_cp437 vfat amd64_edac fat edac_mce_amd snd_usb_audio snd_hda_codec_hdmi gpu_sched drm_buddy snd_hda_intel zfs(POE) snd_usbmidi_lib drm_display_helper snd_intel_dspcfg snd_intel_sdw_acpi snd_rawmidi cec zunicode(POE) zzstd(OE) snd_hda_codec zlua(OE) kvm_amd snd_seq_device zavl(POE) rc_core mc snd_hda_core eeepc_wmi icp(POE) kvm snd_hwdep zcommon(POE) asus_wmi platform_profile battery ccp znvpair(POE) drm_ttm_helper snd_pcm
[72358.041177]  sparse_keymap ledtrig_audio sp5100_tco ttm rapl rfkill spl(OE) joydev wmi_bmof pcspkr mxm_wmi efi_pstore k10temp watchdog snd_timer rng_core drm_kms_helper snd sg soundcore evdev button drm fuse configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 dm_crypt dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear vfio_pci vfio_pci_core irqbypass vfio_virqfd vfio_iommu_type1 vfio hid_generic raid1 usbhid md_mod hid crc32_pclmul crc32c_intel sd_mod ghash_clmulni_intel nvme nvme_core t10_pi xhci_pci ahci libahci xhci_hcd aesni_intel crc64_rocksoft_generic libata crypto_simd e1000e igb cryptd usbcore i2c_algo_bit scsi_mod dca crc64_rocksoft ptp crc_t10dif i2c_piix4 crct10dif_generic crct10dif_pclmul crc64 pps_core crct10dif_common usb_common scsi_common video gpio_amdpt wmi gpio_generic
[72358.043553] CR2: 0000000000000000
[72358.044164] ---[ end trace 0000000000000000 ]---
[72358.741371] RIP: 0010:zap_lockdir_impl+0x29e/0x750 [zfs]
[72358.741371] Code: 86 d8 00 00 00 00 00 00 00 e8 ce a6 14 db 4d 89 a6 98 00 00 00 e9 48 fe ff ff 48 8b 43 18 b9 a1 01 00 00 31 f6 bf 28 01 00 00 <48> 8b 10 48 8b 40 08 48 89 54 24 20 48 c7 c2 a8 58 69 c1 48 89 44
[72358.741371] RSP: 0018:ffffa970d57e3850 EFLAGS: 00010246
[72358.741371] RAX: 0000000000000000 RBX: ffff96927bb6c780 RCX: 00000000000001a1
[72358.741371] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000128
[72358.741371] RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000
[72358.741371] R10: 2000000000000000 R11: 0000008000000000 R12: ffff9692a17cb800
[72358.741371] R13: 0000000000000001 R14: 0000000000000000 R15: ffffa970d57e3980
[72358.741371] FS:  00007f218d70a7c0(0000) GS:ffff9697ae480000(0000) knlGS:0000000000000000
[72358.741371] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[72358.741371] CR2: 0000000000000000 CR3: 000000084dd04000 CR4: 0000000000750ee0
[72358.741371] PKRU: 55555554
jjakob commented 2 weeks ago

FWIW, I haven't had any more panics since my last report. Currently running zfs v2.1.13-pve1, kernel 6.2.16-19-pve. Still running sanoid taking snapshots every 15 minutes and syncoid once a day.