openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.6k stars 1.75k forks source link

ZFS on Linux null pointer dereference #11679

Closed segdy closed 1 month ago

segdy commented 3 years ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version 10 buster
Linux Kernel 4.19.0-14 (SMP Debian 4.19.171-2)
Architecture amd64
ZFS Version 2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Describe the problem you're observing

When I start sending raw ZFS snapshots to a different system, my Linux systen (4.19.0-14-amd64) starts to hang completely. I can ping it, I can start a very commands (such as dmesg) but most commands hang (incl zfs, zpool, htop, ps, ...). The entire systems hangs completely.

Dmesg shows the following entries at the time of the occurance:

[ 2293.134071] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 2293.149707] PGD 0 P4D 0
[ 2293.154752] Oops: 0000 [#1] SMP PTI
[ 2293.161701] CPU: 1 PID: 12576 Comm: receive_writer Tainted: P           OE     4.19.0-14-amd64 #1 Debian 4.19.171-2
[ 2293.182517] Hardware name: Supermicro X10SLL-F/X10SLL-F, BIOS 3.0a 12/21/2015
[ 2293.196819] RIP: 0010:abd_verify+0x5/0x60 [zfs]
[ 2293.205865] Code: 0f 1f 44 00 00 0f 1f 44 00 00 8b 07 c1 e8 05 83 e0 01 c3 66 90 0f 1f 44 00 00 8b 07 c1 e8 06 83 e0 01 c3 66 90 0f 1f 44 00 00 <8b> 07 a8 01 74 01 c3 a8 40 74 43 41 54 4c 8d 67 68 55 53 48 8b 47
[ 2293.243325] RSP: 0018:ffffb12e4b6d7a28 EFLAGS: 00010246
[ 2293.253741] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2293.267974] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[ 2293.282205] RBP: 0000000000004000 R08: ffff935ec10b70b0 R09: 0000000000000000
[ 2293.296434] R10: 0000000000007130 R11: ffff935d75f984e0 R12: 0000000000004000
[ 2293.310664] R13: 0000000000000000 R14: ffffffffc0fea550 R15: 0000000000000020
[ 2293.324900] FS:  0000000000000000(0000) GS:ffff935ecfb00000(0000) knlGS:0000000000000000
[ 2293.341053] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2293.352510] CR2: 0000000000000000 CR3: 000000001340a001 CR4: 00000000000606e0
[ 2293.366743] Call Trace:
[ 2293.371704]  abd_borrow_buf+0x12/0x40 [zfs]
[ 2293.380104]  abd_borrow_buf_copy+0x28/0x70 [zfs]
[ 2293.389377]  zio_crypt_copy_dnode_bonus+0x36/0x130 [zfs]
[ 2293.400041]  arc_buf_fill+0x3ff/0xb60 [zfs]
[ 2293.408449]  ? zfs_btree_add_idx+0xd0/0x200 [zfs]
[ 2293.417889]  arc_untransform+0x1c/0x70 [zfs]
[ 2293.426461]  dbuf_read_verify_dnode_crypt+0xec/0x160 [zfs]
[ 2293.437466]  dbuf_read_impl.constprop.29+0x4ad/0x6b0 [zfs]
[ 2293.448423]  ? kmem_cache_alloc+0x167/0x1d0
[ 2293.456776]  ? __cv_init+0x3d/0x60 [spl]
[ 2293.464671]  ? dbuf_cons+0xa7/0xc0 [zfs]
[ 2293.472497]  ? spl_kmem_cache_alloc+0x108/0x7a0 [spl]
[ 2293.482583]  ? _cond_resched+0x15/0x30
[ 2293.490071]  ? _cond_resched+0x15/0x30
[ 2293.497542]  ? mutex_lock+0xe/0x30
[ 2293.504402]  ? aggsum_add+0x17a/0x190 [zfs]
[ 2293.512810]  dbuf_read+0x1b2/0x520 [zfs]
[ 2293.520672]  ? dnode_hold_impl+0x350/0xc20 [zfs]
[ 2293.529904]  dmu_bonus_hold_by_dnode+0x126/0x1a0 [zfs]
[ 2293.540186]  receive_object+0x403/0xc70 [zfs]
[ 2293.548906]  ? receive_freeobjects.isra.10+0x9d/0x120 [zfs]
[ 2293.560049]  receive_writer_thread+0x279/0xa00 [zfs]
[ 2293.569962]  ? set_curr_task_fair+0x26/0x50
[ 2293.578319]  ? receive_process_write_record+0x190/0x190 [zfs]
[ 2293.589793]  ? __thread_exit+0x20/0x20 [spl]
[ 2293.598317]  ? thread_generic_wrapper+0x6f/0x80 [spl]
[ 2293.608410]  ? receive_process_write_record+0x190/0x190 [zfs]
[ 2293.619882]  thread_generic_wrapper+0x6f/0x80 [spl]
[ 2293.629609]  kthread+0x112/0x130
[ 2293.636053]  ? kthread_bind+0x30/0x30
[ 2293.643351]  ret_from_fork+0x35/0x40
[ 2293.650473] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter veth pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) nf_tables nfnetlink vboxdrv(OE) bridge binfmt_misc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) intel_rapl x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel kvm irqbypass crct10dif_pclmul ib_iser joydev crc32_pclmul rdma_cm ghash_clmulni_intel iw_cm intel_cstate ib_cm intel_uncore ib_core intel_rapl_perf configfs ipmi_si sg ipmi_devintf iTCO_wdt iTCO_vendor_support pcc_cpufreq intel_pch_thermal iscsi_tcp ipmi_msghandler libiscsi_tcp libiscsi evdev scsi_transport_iscsi pcspkr tun nfsd auth_rpcgss nfs_acl lockd grace sunrpc lm85 dme1737 hwmon_vid iptable_nat ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack
[ 2293.793008]  nf_defrag_ipv6 nf_defrag_ipv4 fuse loop 8021q garp stp mrp llc ecryptfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 raid10 uas usb_storage raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear hid_generic raid1 usbhid hid md_mod sd_mod ast ahci ttm libahci libata drm_kms_helper drm crc32c_intel igb i2c_i801 dca i2c_algo_bit scsi_mod lpc_ich mfd_core e1000e xhci_pci ehci_pci xhci_hcd ehci_hcd usbcore usb_common thermal fan video button
[ 2293.895677] CR2: 0000000000000000
[ 2293.902280] ---[ end trace 164c64ca87be80af ]---
[ 2294.020926] RIP: 0010:abd_verify+0x5/0x60 [zfs]
[ 2294.029975] Code: 0f 1f 44 00 00 0f 1f 44 00 00 8b 07 c1 e8 05 83 e0 01 c3 66 90 0f 1f 44 00 00 8b 07 c1 e8 06 83 e0 01 c3 66 90 0f 1f 44 00 00 <8b> 07 a8 01 74 01 c3 a8 40 74 43 41 54 4c 8d 67 68 55 53 48 8b 47
[ 2294.067433] RSP: 0018:ffffb12e4b6d7a28 EFLAGS: 00010246
[ 2294.077850] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2294.092082] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[ 2294.106312] RBP: 0000000000004000 R08: ffff935ec10b70b0 R09: 0000000000000000
[ 2294.120542] R10: 0000000000007130 R11: ffff935d75f984e0 R12: 0000000000004000
[ 2294.134774] R13: 0000000000000000 R14: ffffffffc0fea550 R15: 0000000000000020
[ 2294.149006] FS:  0000000000000000(0000) GS:ffff935ecfb00000(0000) knlGS:0000000000000000
[ 2294.165144] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2294.176600] CR2: 0000000000000000 CR3: 000000001340a001 CR4: 00000000000606e0

Interestingly, the transfer continues happily but just everything else in the system hangs. The only way to recover is resetting the machine (since not even reboot works).

Describe how to reproduce the problem

It's a tough one. It seems to me that the issue might be load related in some sense since it only occurs if I have two zfs send's (via syncoid) running in parallel that have to do with encrypted datasets.

Transfer 1

The first one sends datasets from an unecrypted dataset into an encrypted one (I migrate to encryption).

I use syncoid and use the command:

syncoid -r --skip-parent --no-sync-snap zpradix1imain/sys/vz zpradix1imain/sys/vz_enc

This translates into

zfs send -I 'zpradix1imain/sys/vz/main'@'zfs-auto-snap_hourly-2021-03-02-1917' 'zpradix1imain/sys/vz/main'@'zfs-auto-snap_frequent-2021-03-02-1932' | mbuffer -q -s 128k -m 16M 2>/dev/null | pv -s 16392592 | zfs receive -s -F 'zpradix1imain/sys/vz_enc/main'

Transfer 2

I transfer data from an encrypted dataset raw to a secondary server. The syncoid command is:

syncoid -r --skip-parent --no-sync-snap --sendoptions=w --exclude=zfs-auto-snap_hourly --exclude=zfs-auto-snap_frequent zpradix1imain/data root@192.168.200.12:zpzetta/radix/data

This translates into:

zfs send -w 'zpradix1imain/data/home'@'vicari-prev' | pv -s 179222507064 | lzop | mbuffer -q -s 128k -m 16M 2>/dev/null | ssh ...

In summary:

scratchings commented 1 year ago

Sorry, no improvement at my end - lasted about 4 days:

Feb 4 09:15:44 fs3 kernel: VERIFY0(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5) Feb 4 09:15:44 fs3 kernel: PANIC at dmu_recv.c:2083:receive_object() Feb 4 09:15:44 fs3 kernel: Showing stack for process 1438226 Feb 4 09:15:44 fs3 kernel: CPU: 5 PID: 1438226 Comm: receive_writer Kdump: loaded Tainted: P OE --------- --- 5.14.0-162.12.1.el9_1.0.2.x86_64

1

Feb 4 09:15:44 fs3 kernel: Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 2.1 06/14/2018 Feb 4 09:15:44 fs3 kernel: Call Trace: Feb 4 09:15:44 fs3 kernel: dump_stack_lvl+0x34/0x48 Feb 4 09:15:44 fs3 kernel: spl_panic+0xd1/0xe9 [spl] Feb 4 09:15:44 fs3 kernel: ? spl_kmem_cache_free+0xff/0x1b0 [spl] Feb 4 09:15:44 fs3 kernel: ? mutex_lock+0xe/0x30 Feb 4 09:15:44 fs3 kernel: ? aggsum_add+0x173/0x190 [zfs] Feb 4 09:15:44 fs3 kernel: ? mutex_lock+0xe/0x30 Feb 4 09:15:44 fs3 kernel: ? aggsum_add+0x173/0x190 [zfs] Feb 4 09:15:44 fs3 kernel: ? dnode_evict_bonus+0x7d/0xa0 [zfs] Feb 4 09:15:44 fs3 kernel: ? dbuf_rele_and_unlock+0x312/0x4d0 [zfs] Feb 4 09:15:44 fs3 kernel: ? dnode_rele_and_unlock+0x59/0xf0 [zfs] Feb 4 09:15:44 fs3 kernel: receive_object+0x8b9/0x970 [zfs] Feb 4 09:15:44 fs3 kernel: ? receive_writer_thread+0x91/0x1c0 [zfs] Feb 4 09:15:44 fs3 kernel: receive_process_record+0x13f/0x330 [zfs] Feb 4 09:15:44 fs3 kernel: receive_writer_thread+0xbb/0x1c0 [zfs] Feb 4 09:15:44 fs3 kernel: ? receive_process_record+0x330/0x330 [zfs] Feb 4 09:15:44 fs3 kernel: thread_generic_wrapper+0x56/0x70 [spl] Feb 4 09:15:44 fs3 kernel: ? spl_taskq_fini+0x80/0x80 [spl] Feb 4 09:15:44 fs3 kernel: kthread+0x146/0x170 Feb 4 09:15:44 fs3 kernel: ? set_kthread_struct+0x50/0x50 Feb 4 09:15:44 fs3 kernel: ret_from_fork+0x1f/0x30

rincebrain commented 1 year ago

That backtrace is weird and I'm not sure if I trust it.

That said, I'm tempted to suggest the people reporting issues with dmu_bonus_hold_by_dnode go open a separate bug, since while they're both on receive, the failure seems different. (It still might be the same root cause, I can't say until both are fixed, but looking at it, getting back an IO error and tripping an assert, while it's bad, is definitely a different failure...)

siilike commented 1 year ago

Just happened with zfs-2.1.9-1~bpo11+1:

[ 6303.593120] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 6303.593170] #PF: supervisor read access in kernel mode
[ 6303.593211] #PF: error_code(0x0000) - not-present page
[ 6303.593251] PGD 11653c067 P4D 11653c067 PUD 0 
[ 6303.593296] Oops: 0000 [#1] SMP NOPTI
[ 6303.593338] CPU: 1 PID: 42118 Comm: receive_writer Tainted: P           OE     5.10.0-20-amd64 #1 Debian 5.10.158-2
[ 6303.593387] Hardware name: System manufacturer System Product Name/PRIME A320M-K, BIOS 5603 10/14/2020
[ 6303.593560] RIP: 0010:abd_borrow_buf_copy+0x21/0x90 [zfs]
[ 6303.593607] Code: 45 42 11 00 0f 1f 44 00 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 <f6> 07 01 74 25 4c 8b 6f 48 48 8b 44 24 08 65 48 2b 04 25 28 00 00
[ 6303.593669] RSP: 0018:ffffacf280b2b9b0 EFLAGS: 00010246
[ 6303.593713] RAX: 0000000000000000 RBX: ffff9e8bcae59b00 RCX: 0000000000000000
[ 6303.593756] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[ 6303.593799] RBP: 0000000000000000 R08: 000005bbab6d9501 R09: 0000000000000001
[ 6303.593843] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000010
[ 6303.593885] R13: 0000000000004000 R14: 0000000000000000 R15: 0000000000000020
[ 6303.593929] FS:  0000000000000000(0000) GS:ffff9e8d1e640000(0000) knlGS:0000000000000000
[ 6303.593977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6303.594021] CR2: 0000000000000000 CR3: 000000012efd4000 CR4: 00000000001506e0
[ 6303.594068] Call Trace:
[ 6303.594247]  zio_crypt_copy_dnode_bonus+0x2e/0x120 [zfs]
[ 6303.594344]  arc_buf_fill+0x3f9/0xce0 [zfs]
[ 6303.594430]  arc_untransform+0x1d/0x80 [zfs]
[ 6303.594515]  dbuf_read_verify_dnode_crypt+0xf2/0x160 [zfs]
[ 6303.594604]  dbuf_read_impl.constprop.0+0x2c4/0x6e0 [zfs]
[ 6303.594650]  ? _cond_resched+0x16/0x50
[ 6303.594743]  ? dbuf_create+0x43c/0x610 [zfs]
[ 6303.594824]  dbuf_read+0xe2/0x5d0 [zfs]
[ 6303.594907]  dmu_tx_check_ioerr+0x64/0xd0 [zfs]
[ 6303.594992]  dmu_tx_hold_free_impl+0x12f/0x250 [zfs]
[ 6303.595074]  dmu_free_long_range+0x242/0x4d0 [zfs]
[ 6303.595159]  dmu_free_long_object+0x22/0xd0 [zfs]
[ 6303.595240]  receive_freeobjects+0x82/0x100 [zfs]
[ 6303.595324]  receive_writer_thread+0x565/0xad0 [zfs]
[ 6303.595376]  ? thread_generic_wrapper+0x62/0x80 [spl]
[ 6303.595418]  ? kfree+0x410/0x490
[ 6303.595504]  ? receive_process_write_record+0x1a0/0x1a0 [zfs]
[ 6303.595553]  ? thread_generic_wrapper+0x6f/0x80 [spl]
[ 6303.595600]  thread_generic_wrapper+0x6f/0x80 [spl]
[ 6303.595648]  ? __thread_exit+0x20/0x20 [spl]
[ 6303.595693]  kthread+0x11b/0x140
[ 6303.595735]  ? __kthread_bind_mask+0x60/0x60
[ 6303.595779]  ret_from_fork+0x22/0x30
[ 6303.595823] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha ipmi_devintf ipmi_msghandler amdgpu snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio edac_mce_amd snd_hda_codec_hdmi kvm_amd snd_hda_intel ccp snd_intel_dspcfg rng_core soundwire_intel soundwire_generic_allocation snd_soc_core kvm snd_compress soundwire_cadence snd_usb_audio snd_hda_codec eeepc_wmi snd_hda_core irqbypass asus_wmi gpu_sched battery ttm snd_usbmidi_lib sparse_keymap k10temp snd_hwdep rfkill wmi_bmof snd_rawmidi soundwire_bus sp5100_tco pcspkr snd_seq_device mc fam15h_power watchdog snd_pcm drm_kms_helper snd_timer snd cec soundcore i2c_algo_bit acpi_cpufreq button evdev sg nfsd auth_rpcgss nfs_acl lockd grace parport_pc ppdev lp parport sunrpc fuse drm configfs ip_tables x_tables autofs4 zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) dm_crypt dm_mod
[ 6303.595883]  hid_generic uas usbhid usb_storage hid crc32_pclmul crc32c_intel sd_mod xhci_pci ghash_clmulni_intel ahci nvme libahci r8169 xhci_hcd mpt3sas aesni_intel libata libaes realtek crypto_simd mdio_devres nvme_core cryptd libphy glue_helper raid_class usbcore i2c_piix4 scsi_transport_sas scsi_mod t10_pi crc_t10dif crct10dif_generic usb_common crct10dif_pclmul crct10dif_common wmi gpio_amdpt video gpio_generic
[ 6303.596185] CR2: 0000000000000000
[ 6303.596227] ---[ end trace 93dfc94348774efc ]---
[ 6303.596346] RIP: 0010:abd_borrow_buf_copy+0x21/0x90 [zfs]
[ 6303.596392] Code: 45 42 11 00 0f 1f 44 00 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 <f6> 07 01 74 25 4c 8b 6f 48 48 8b 44 24 08 65 48 2b 04 25 28 00 00
[ 6303.596460] RSP: 0018:ffffacf280b2b9b0 EFLAGS: 00010246
[ 6303.596505] RAX: 0000000000000000 RBX: ffff9e8bcae59b00 RCX: 0000000000000000
[ 6303.596549] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[ 6303.596591] RBP: 0000000000000000 R08: 000005bbab6d9501 R09: 0000000000000001
[ 6303.596632] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000010
[ 6303.596673] R13: 0000000000004000 R14: 0000000000000000 R15: 0000000000000020
[ 6303.596715] FS:  0000000000000000(0000) GS:ffff9e8d1e640000(0000) knlGS:0000000000000000
[ 6303.596759] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6303.596798] CR2: 0000000000000000 CR3: 000000012efd4000 CR4: 00000000001506e0

Unlikely to be related, but was receiving a dataset affected by #12014.

chenxiaolong commented 1 year ago

I think I might've run into this issue too. In my case, I was migrating between pools on the same machine by doing a recursive raw send on the encryptionroot dataset:

# 1. [Succeeded] Migrate datasets to temporary pool
zfs send -R -L -h -w satapool0/enc@migrate | pv | zfs recv -F -d -u temp0 -x recordsize -x compression -x atime -x secondarycache -x relatime

# 2. [Succeeded] Recreated satapool0 with different topology

# 3. [Failed with null pointer dereference] Migrate datasets back to the recreated satapool0
zfs send -R -L -h -w temp0/enc@migrate | pv | zfs recv -F -d -u satapool0 -x recordsize -x compression -x atime -x secondarycache -x relatime

On the last command, it sent 9.28TiB out of 61TiB before the crash.

Distro: Fedora 37 ZFS: 2.1.9-1.fc37.noarch

Stack trace from dmesg:

[472317.292466] BUG: kernel NULL pointer dereference, address: 0000000000000030
[472317.300401] #PF: supervisor read access in kernel mode
[472317.306279] #PF: error_code(0x0000) - not-present page
[472317.312151] PGD 0 P4D 0
[472317.315095] Oops: 0000 [#1] PREEMPT SMP NOPTI
[472317.320088] CPU: 8 PID: 250216 Comm: zfs Tainted: G        W  O       6.2.8-200.fc37.x86_64 #1
[472317.329862] Hardware name: Supermicro Super Server/X13SAE-F, BIOS 2.0 10/17/2022
[472317.338262] RIP: 0010:dmu_dump_write+0x31a/0x3d0 [zfs]
[472317.344216] Code: 4c 24 14 0f 85 b8 00 00 00 c7 45 54 00 00 00 00 48 8b 4d 00 e9 4a fd ff ff 31 c0 45 39 cd 0f 95 c0 44 09 c0 0f 84 ce fd ff ff <48> 8b 04 25 30 00 00 00 4d 63 e9 45 85 c0 0f 85 38 ff ff ff 48 c1
[472317.365405] RSP: 0018:ffffb43080fcf7f8 EFLAGS: 00010206
[472317.371376] RAX: 0000000001000000 RBX: ffff998038b0fc00 RCX: 0000000000000000
[472317.379491] RDX: 0000000000010080 RSI: 0000000000000000 RDI: ffff998038b0fd38
[472317.387607] RBP: ffffb43080fcf9c0 R08: 0000000001000000 R09: 0000000000020000
[472317.395720] R10: 0000000000000013 R11: 0000000000020000 R12: 0000000000000000
[472317.403833] R13: 0000000000020000 R14: 0000000000000000 R15: 0000000000010080
[472317.411946] FS:  00007fe4ad07e8c0(0000) GS:ffff998a7f800000(0000) knlGS:0000000000000000
[472317.421133] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[472317.427685] CR2: 0000000000000030 CR3: 00000005aa50c005 CR4: 0000000000772ee0
[472317.435800] PKRU: 55555554
[472317.438939] Call Trace:
[472317.441785]  <TASK>
[472317.444240]  do_dump+0x637/0x950 [zfs]
[472317.448620]  dmu_send_impl+0xd30/0x1440 [zfs]
[472317.453675]  ? __pfx_dsl_dataset_evict_async+0x10/0x10 [zfs]
[472317.460201]  ? preempt_count_add+0x6a/0xa0
[472317.464918]  ? _raw_spin_lock+0x13/0x40
[472317.469326]  ? dbuf_rele_and_unlock+0xf3/0x770 [zfs]
[472317.475063]  ? percpu_counter_add_batch+0x53/0xc0
[472317.480448]  dmu_send_obj+0x25c/0x350 [zfs]
[472317.485309]  zfs_ioc_send+0xf3/0x300 [zfs]
[472317.490096]  ? __pfx_dump_bytes+0x10/0x10 [zfs]
[472317.495366]  zfsdev_ioctl_common+0x89e/0x9e0 [zfs]
[472317.500928]  ? spl_kmem_zalloc+0xab/0x110 [spl]
[472317.506133]  ? __kmalloc_large_node+0xb5/0x140
[472317.511217]  zfsdev_ioctl+0x4f/0xd0 [zfs]
[472317.515896]  __x64_sys_ioctl+0x8d/0xd0
[472317.520208]  do_syscall_64+0x58/0x80
[472317.524322]  ? syscall_exit_to_user_mode+0x17/0x40
[472317.529802]  ? do_syscall_64+0x67/0x80
[472317.534112]  ? do_syscall_64+0x67/0x80
[472317.538421]  ? user_return_notifier_unregister+0x3c/0x70
[472317.544669]  ? fpregs_restore_userregs+0x56/0xe0
[472317.550140]  ? exit_to_user_mode_prepare+0x18f/0x1f0
[472317.555999]  ? syscall_exit_to_user_mode+0x17/0x40
[472317.561661]  ? do_syscall_64+0x67/0x80
[472317.566149]  ? syscall_exit_to_user_mode+0x17/0x40
[472317.571803]  ? do_syscall_64+0x67/0x80
[472317.576284]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[472317.582228] RIP: 0033:0x7fe4ad414d6f
[472317.586513] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[472317.608050] RSP: 002b:00007fff823b1280 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[472317.616832] RAX: ffffffffffffffda RBX: 0000561fb6a07f90 RCX: 00007fe4ad414d6f
[472317.625125] RDX: 00007fff823b1710 RSI: 0000000000005a1c RDI: 0000000000000003
[472317.633415] RBP: 00007fff823b4d00 R08: 0000561fb694b770 R09: 0000000000000000
[472317.641701] R10: 00007fff823b9bb0 R11: 0000000000000246 R12: 0000561fb683f330
[472317.649982] R13: 00007fff823b1710 R14: 0000561fb6a07fa0 R15: 0000000000000000
[472317.658251]  </TASK>
[472317.660966] Modules linked in: uas usb_storage vhost_net vhost vhost_iotlb tap tun xt_MASQUERADE xt_conntrack xt_CHECKSUM ipt_REJECT nft_compat nf_nat_tftp nf_conntrack_tftp rpcrdma rdma_cm iw_cm ib_cm 8021q garp mrp bridge stp llc rfkill nf_conntrack_netbios_ns nf_co
nntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr snd_sof_pci_intel_tgl snd_sof_intel_hda_common
soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_hda_codec_hdmi snd_sof snd_sof_utils snd_soc_hdac_hda intel_rapl_msr snd_hda_ext_core intel_rapl_common snd_soc_acpi_intel_match snd_soc_acpi x86_pkg_temp_t
hermal intel_powerclamp snd_hda_codec_realtek soundwire_bus snd_hda_codec_generic snd_soc_core ledtrig_audio coretemp snd_compress iTCO_wdt kvm_intel ac97_bus snd_pcm_dmaengine
[472317.661001]  snd_hda_intel intel_pmc_bxt pmt_telemetry vfat snd_intel_dspcfg ipmi_ssif mei_hdcp mei_wdt mei_pxp iTCO_vendor_support pmt_class kvm zfs(O) fat irqbypass rapl snd_intel_sdw_acpi snd_hda_codec zunicode(O) snd_hda_core intel_cstate mlx5_ib zzstd(O) snd_hwde
p snd_seq snd_seq_device zlua(O) snd_pcm ib_uverbs zcommon(O) wmi_bmof znvpair(O) snd_timer intel_uncore zavl(O) snd i2c_i801 soundcore i2c_smbus ib_core icp(O) mei_me acpi_ipmi mei joydev ipmi_si idma64 spl(O) intel_vsec ipmi_devintf ipmi_msghandler acpi_tad acpi_pad nfs
d auth_rpcgss nfs_acl lockd grace tcp_bbr sunrpc fuse loop zram i915 rndis_host cdc_ether usbnet crct10dif_pclmul dm_crypt mii mlx5_core drm_buddy crc32_pclmul nvme crc32c_intel polyval_clmulni polyval_generic mlxfw e1000e mpt3sas drm_display_helper nvme_core ghash_clmuln
i_intel sha512_ssse3 tls ast cec ucsi_acpi typec_ucsi ttm raid_class psample scsi_transport_sas nvme_common typec pci_hyperv_intf video wmi pinctrl_alderlake scsi_dh_rdac scsi_dh_emc
[472317.759484]  scsi_dh_alua dm_multipath
[472317.863404] CR2: 0000000000000030
[472317.867470] ---[ end trace 0000000000000000 ]---
[472317.944036] RIP: 0010:dmu_dump_write+0x31a/0x3d0 [zfs]
[472317.950295] Code: 4c 24 14 0f 85 b8 00 00 00 c7 45 54 00 00 00 00 48 8b 4d 00 e9 4a fd ff ff 31 c0 45 39 cd 0f 95 c0 44 09 c0 0f 84 ce fd ff ff <48> 8b 04 25 30 00 00 00 4d 63 e9 45 85 c0 0f 85 38 ff ff ff 48 c1
[472317.971991] RSP: 0018:ffffb43080fcf7f8 EFLAGS: 00010206
[472317.978220] RAX: 0000000001000000 RBX: ffff998038b0fc00 RCX: 0000000000000000
[472317.986597] RDX: 0000000000010080 RSI: 0000000000000000 RDI: ffff998038b0fd38
[472317.994972] RBP: ffffb43080fcf9c0 R08: 0000000001000000 R09: 0000000000020000
[472318.003349] R10: 0000000000000013 R11: 0000000000020000 R12: 0000000000000000
[472318.011728] R13: 0000000000020000 R14: 0000000000000000 R15: 0000000000010080
[472318.020105] FS:  00007fe4ad07e8c0(0000) GS:ffff998a7f800000(0000) knlGS:0000000000000000
[472318.029561] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[472318.036383] CR2: 0000000000000030 CR3: 00000005aa50c005 CR4: 0000000000772ee0
[472318.044767] PKRU: 55555554
[472318.048178] note: zfs[250216] exited with irqs disabled
jittygitty commented 1 year ago

Has anyone found a way to "reproduce" this on v0.8.3 to v0.8.5? And has anyone ever triggered it on v0.8.0 to v0.8.2?

amotin commented 6 months ago

I expect this to be already fixed by #16104 in master. It is likely too fresh for upcoming 2.2.4, but should appear in following 2.2.5 release.

vaclavskala commented 4 months ago

With zfs 2.2.4 with patch #16104 applied I still get kernel panics with this stacktrace:

[Mon Jun 24 08:48:45 2024] VERIFY0(dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
[Mon Jun 24 08:48:45 2024] PANIC at dmu_recv.c:2093:receive_object()
[Mon Jun 24 08:48:45 2024] Showing stack for process 3446320
[Mon Jun 24 08:48:45 2024] CPU: 35 PID: 3446320 Comm: receive_writer Not tainted 6.1.94-vsh0zfs224 #3
[Mon Jun 24 08:48:45 2024] Hardware name: Supermicro X10DRi/X10DRi, BIOS 3.4a 08/16/2021
[Mon Jun 24 08:48:45 2024] Call Trace:
[Mon Jun 24 08:48:45 2024]  <TASK>
[Mon Jun 24 08:48:45 2024]  dump_stack_lvl+0x45/0x5e
[Mon Jun 24 08:48:45 2024]  spl_panic+0xd1/0xe9 [spl]
[Mon Jun 24 08:48:45 2024]  ? spl_kmem_cache_free+0x127/0x1d0 [spl]
[Mon Jun 24 08:48:45 2024]  ? mutex_lock+0xe/0x30
[Mon Jun 24 08:48:45 2024]  ? aggsum_add+0x173/0x190 [zfs]
[Mon Jun 24 08:48:45 2024]  ? dnode_evict_bonus+0x7d/0xa0 [zfs]
[Mon Jun 24 08:48:45 2024]  ? dbuf_rele_and_unlock+0x312/0x4d0 [zfs]
[Mon Jun 24 08:48:45 2024]  ? dnode_rele_and_unlock+0x59/0xf0 [zfs]
[Mon Jun 24 08:48:45 2024]  receive_object+0xae1/0xce0 [zfs]
[Mon Jun 24 08:48:45 2024]  ? dmu_object_next+0xe7/0x160 [zfs]
[Mon Jun 24 08:48:45 2024]  ? receive_freeobjects+0xa8/0x110 [zfs]
[Mon Jun 24 08:48:45 2024]  receive_writer_thread+0x313/0xb20 [zfs]
[Mon Jun 24 08:48:45 2024]  ? __slab_free+0x9e/0x2b0
[Mon Jun 24 08:48:45 2024]  ? set_next_task_fair+0x2d/0xd0
[Mon Jun 24 08:48:45 2024]  ? receive_process_write_record+0x2d0/0x2d0 [zfs]
[Mon Jun 24 08:48:45 2024]  ? spl_taskq_fini+0x80/0x80 [spl]
[Mon Jun 24 08:48:45 2024]  ? thread_generic_wrapper+0x56/0x70 [spl]
[Mon Jun 24 08:48:45 2024]  thread_generic_wrapper+0x56/0x70 [spl]
[Mon Jun 24 08:48:45 2024]  kthread+0xd6/0x100
[Mon Jun 24 08:48:45 2024]  ? kthread_complete_and_exit+0x20/0x20
[Mon Jun 24 08:48:45 2024]  ret_from_fork+0x1f/0x30
[Mon Jun 24 08:48:45 2024]  </TASK>
amano-kenji commented 1 month ago

Is this fixed with 2.2.6?

scratchings commented 1 month ago

I updated to 2.2.6 a few days ago, so haven't had enough time to be confident about this. I'd been running a patched earlier release with significant success, but had also stopped daisy-chained sending to a third host, so it's not clear to me if this is the main reason for the vastly improved stability of the system. I'm currently starting the process of migrating my backups to a new platform which will be heavily stressing send/receive so we'll see how it goes.

clhedrick commented 1 month ago

If it was fixed, it was presumably in 2.2.5 https://github.com/openzfs/zfs/pull/16104, which replaces https://github.com/openzfs/zfs/pull/15538


From: Duncan Mortimer @.> Sent: Wednesday, September 18, 2024 8:27 AM To: openzfs/zfs @.> Cc: Charles Hedrick @.>; Mention @.> Subject: Re: [openzfs/zfs] ZFS on Linux null pointer dereference (#11679)

I updated to 2.2.6 a few days ago, so haven't had enough time to be confident about this. I'd been running a patched earlier release with significant success, but had also stopped daisy-chained sending to a third host, so it's not clear to me if this is the main reason for the vastly improved stability of the system. I'm currently starting the process of migrating my backups to a new platform which will be heavily stressing send/receive so we'll see how it goes.

— Reply to this email directly, view it on GitHubhttps://github.com/openzfs/zfs/issues/11679#issuecomment-2358335529, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAORUCGINTWHXSGTBGFUTGDZXFWUTAVCNFSM4YPT72K2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMZVHAZTGNJVGI4Q. You are receiving this because you were mentioned.

scratchings commented 1 month ago

Indeed, but my box was running 2.2.4 with the proposed patches, so never tested 2.2.5 as released.

scratchings commented 1 month ago

Sorry to report that the machine running 2.2.6 spontaneously rebooted overnight. I can only assume a kernel panic, but as nothing has been logged I can't be certain that ZFS was the cause.

amano-kenji commented 1 month ago

Usually a kernel panic freezes a system and requires a manual reboot.

rincebrain commented 1 month ago

You can set a tunable to trigger triple fault on panic, though I don't think any distros do by default, to my knowledge.

...well, sorry, that was imprecise and inaccurate.

A panic will generally reboot the machine outright by default on most distros (possibly with a round trip through kexec+kdump). Oops and BUG_ON don't, out of the box.

scratchings commented 1 month ago

Indeed. I was surprised to see it stuck at the LUKS (boot drive) passphrase prompt with a clearly faulty network connection as Clevis hadn’t auto-unlocked it. It then needed a further reboot before it came up, so any lingering previous boot info was long gone before I could interact with it.

The system logs just stop with nothing in them.

kernel.panic = 0 kernel.panic_on_io_nmi = 0 kernel.panic_on_oops = 1 kernel.panic_on_rcu_stall = 0 kernel.panic_on_unrecovered_nmi = 0 kernel.panic_on_warn = 0 kernel.panic_print = 0 kernel.watchdog = 1 kernel.watchdog_thresh = 10

So I suspect the watchdog kicked in.

rincebrain commented 1 month ago

I'd suggest configuring kdump if you want to figure out more, possibly over the network if a kernel core dump saved on local disk is fraught for you, though perhaps just logging the output of when it broke would be sufficient for you.

scratchings commented 1 month ago

It definitely panicked a few minutes ago.

Message from syslogd@backup01 at Sep 23 10:00:16 ... kernel:VERIFY0(dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)

Message from syslogd@backup01 at Sep 23 10:00:16 ... kernel:PANIC at dmu_recv.c:2093:receive_object() [duncan@backup01 ~]$ client_loop: send disconnect: Broken pipe

It had managed to internally (old to new pool) send/receive over 6TB of data over the weekend, a process that had completed. Looking at logs there were a couple of remote receives launched at the time of the crash - looks like I need to re-instate my Syncoid concurrency prevention wrapper.

Section from /var/log/messages attached. Although kdump is configured it didn't write anything to /var/crash :-(

zfs_crash.log

scratchings commented 1 month ago

Today's errant behaviour is a stuck 'zfs receive' - 100% CPU utilisation, no obvious progress (dataset size is static) and receive is unkillable.

amano-kenji commented 1 month ago

Until zfs native encrypted backups become stable, you can use restic to make encrypted remote snapshots of ZFS snapshots. Restic snapshots can be encrypted and incremental, and you can delete any restic snapshot without losing data. Restic has its own deduplicated blocks.

You can use https://zfs.rent or https://www.rsync.net/ with restic right now.

For now, LUKS is faster than zfs native encryption. Thus, LUKS, zfs, and restic are the best options now.

amano-kenji commented 1 month ago

@scratchings If you remove dust in your computer case, does the issue go away? Sometimes, dust in computer case causes errors in RAM or GPU.

My GPU used to freeze until I removed dust in my computer case.

scratchings commented 1 month ago

My host is server grade, ECC RAM (with no reported bit corrections) in a clean server room, and only a few months old, so I very much doubt this.

amotin commented 1 month ago

As I have written above, I believe the original panic of the report and its later reproductions should be fixed in 2.2.5 and up. Following report seems to be different, so lets close this and open new issue if needed

amotin commented 1 month ago

Looking at logs there were a couple of remote receives launched at the time of the crash - looks like I need to re-instate my Syncoid concurrency prevention wrapper.

Section from /var/log/messages attached.

@scratchings The panic definitely happened during receive. What is that "Syncoid concurrency prevention wrapper" and what does it prevent and why was it needed? Any evidences that concurrent receives are bad? I suppose it does not try to concurrently receive the same dataset or something like that?

scratchings commented 1 month ago

It's a python script that uses lock files to ensure that only one syncoid can run on the instigating host, and also checks that syncoid isn't running on the sending host (to prevent crashes on that host when it's being used in a daisy-chain fashion).

I've reached the point with this particular host (which is replacing the original host that first suffered this issue, and has always had ZFS related crashes, no matter which version of ZFS it used) that I'm going to give up and convert it to LUKS + plain ZFS as I was having to send/receive onto a new JBOD anyway.

As someone mentioned historic pool migration, the pool I'm migrating from is from the 0.8 RC era - old enough to have originally suffered from Errata #4 (but the dataset in question was send/received to a fresh dataset to clear this years ago).

We will still have a few hosts running native encryption, one of which is running 2.2.6, and those that take snapshots regularly experience #12014. At least these hosts have never crashed, I've only ever seen this on the receiving side.

Even post LUKSing I'll still be running encryption for the month+ it will take to transfer the >400TB of data, so if I get further crashes I'll report them.

Perhaps of interest... When I got the first zfs process hang, it was a receive on the old pool (aes-256-ccm, lz4), there was a recursive receive in progress from that pool to the new pool (aes-256-gcm + recompression with zstd), this continued to operate for several more days (10+TB transferred and multiple datasets) until it too got a stuck zfs process. I guess if this is a race condition, then the higher performance of the gcm vs ccm encryption may be helping somewhat.

amotin commented 1 month ago

@scratchings You still haven't really answered my questions. While working on https://github.com/openzfs/zfs/pull/16104 I've noticed that ZFS dbuf layer is generally unable to handle both encrypted and non-encrypted data for a dataset same time. It should not happen normally, since dataset should not be accessible until encrypted receive is fully completed and TXG is synced, but if we guess there are some races, then we may see all kinds of problems and I want to know about it. That is why I want to know what sorts of concurrency creates you a problem and what sort of serialization fixes it. It would be huge help if we could reproduce it manually somehow, but we are rarely that lucky.

Speaking of mentioned "recompression" as part of encrypted dataset replication, that would be possible only if the replication is unencrypted. Is that what we are talking about in general, or it was that only specific case? Unencrypted replication of encrypted dataset? Because it might be very different case from encrypted replication I was thinking about with very different dragons.

scratchings commented 1 month ago

This is a host that backs up several (ca 10) hosts over both local 10Gbit and remote 1Gbit SSH connections. The most stable environment has been when the host pulls backups from remote hosts such that it can ensure that only one receive is ever happening across the whole pool. I wrote a py script to wrap syncoid calls using lock files such that the cron tasks can launch on schedule and then wait until nothing has a lock on this file. We also pushed backups to a remote location (slower network) and check for syncoids running there (it's doing some backups of hosts local to it). It's this latter process that seems to have the biggest impact on stability. Our remote push destination failed so regularly that the common snapshots were lost and thus sends stopped. At this point the stability of the primary backup location improved markedly.

I've now embarked on a replacement of this primary backup host (new server hardware and new JBOD). As part of this I briefly (based on the optimism that 2.2.5 fixed the crashes) switched back to client instigated receives (i.e. not central control of when this might happen, cron tasks on the client launch syncoid in push mode) with no exclusivity locks as this obviously has benefits for ensuring the client is in a 'good' state with respect to service backups and means that other backups aren't held up in the case of a large number of blocks needing to be transferred. This made things go bad.

As to the 'recompression', this is by overriding the compression setting on receive, e.g. for the internal transfer to the new pool:

syncoid -r --no-sync-snap --mbuffer-size=1G --compress=none --no-privilege-elevation --recvoptions="u o compression=zstd" old_pool/folder new_pool/folder

When we see 'corrupt'snapshots on the client end this is 'discovered' during the send process, syncoid will report:

warning: cannot send 'pool/ds@autosnap_2024-10-03_10:30:01_hourly': Invalid argument cannot receive incremental stream: most recent snapshot of pool/ds does not match incremental source mbuffer: error: outputThread: error writing to <stdout> at offset 0x501d1a000: Broken pipe mbuffer: warning: error during output to <stdout>: Broken pipe CRITICAL ERROR: ssh -c aes128-gcm@openssh.com -S /tmp/host_sync@host.name ' zfs send -I '"'"'pool/ds'"'"'@'"'"'autosnap_2024-09-01_00:30:01_monthly'"'"' '"'"'pool/ds'"'"'@'"'"'autosnap_2024-10-03_11:30:02_hourly'"'"' | lzop | mbuffer -q -s 128k -m 16M' | mbuffer -q -s 128k -m 16M | lzop -dfc | pv -p -t -e -r -b -s 45331110800 | zfs receive -u -o compression=zstd -s -F 'old_pool/host-name/ds' failed: 256 at /usr/sbin/syncoid line 585.

This is Red Hat Enterprise 9.

Happy to clarify further as required.

vaclavskala commented 1 month ago

This is not hw problem. I managed to reproduce it on VM server. Test setup was: unencrypted pool1 with postgresql (and running pg_bench to generate changes), replicated using snapshots to encrypted pool2 and pool2 was sending snapshots (unencrypted, no raw send) to /dev/null.

Few times I got this error message about dmu_recv, but i can not reproduce it at every run. But it looks like it is caused when sending unencrypted snapshot from encrypted dataset, this save unencrypted data in memory and then receive new snapshot.

In production backup server this same situation happens when source server send snapshot to backup server while there was recent send of this dataset from this backup server to another backup server (to ballance disk usage across servers).

scratchings commented 3 weeks ago

In case it's significant. My receiving pool is quite full - 94% (has been as high as 98%) and ca 20% fragmentation (it's a 650TB pool, so even at 94% there's ca 40TB of free space).