openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.73k forks source link

ZFS on Linux null pointer dereference #11679

Open segdy opened 3 years ago

segdy commented 3 years ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version 10 buster
Linux Kernel 4.19.0-14 (SMP Debian 4.19.171-2)
Architecture amd64
ZFS Version 2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Describe the problem you're observing

When I start sending raw ZFS snapshots to a different system, my Linux systen (4.19.0-14-amd64) starts to hang completely. I can ping it, I can start a very commands (such as dmesg) but most commands hang (incl zfs, zpool, htop, ps, ...). The entire systems hangs completely.

Dmesg shows the following entries at the time of the occurance:

[ 2293.134071] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 2293.149707] PGD 0 P4D 0
[ 2293.154752] Oops: 0000 [#1] SMP PTI
[ 2293.161701] CPU: 1 PID: 12576 Comm: receive_writer Tainted: P           OE     4.19.0-14-amd64 #1 Debian 4.19.171-2
[ 2293.182517] Hardware name: Supermicro X10SLL-F/X10SLL-F, BIOS 3.0a 12/21/2015
[ 2293.196819] RIP: 0010:abd_verify+0x5/0x60 [zfs]
[ 2293.205865] Code: 0f 1f 44 00 00 0f 1f 44 00 00 8b 07 c1 e8 05 83 e0 01 c3 66 90 0f 1f 44 00 00 8b 07 c1 e8 06 83 e0 01 c3 66 90 0f 1f 44 00 00 <8b> 07 a8 01 74 01 c3 a8 40 74 43 41 54 4c 8d 67 68 55 53 48 8b 47
[ 2293.243325] RSP: 0018:ffffb12e4b6d7a28 EFLAGS: 00010246
[ 2293.253741] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2293.267974] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[ 2293.282205] RBP: 0000000000004000 R08: ffff935ec10b70b0 R09: 0000000000000000
[ 2293.296434] R10: 0000000000007130 R11: ffff935d75f984e0 R12: 0000000000004000
[ 2293.310664] R13: 0000000000000000 R14: ffffffffc0fea550 R15: 0000000000000020
[ 2293.324900] FS:  0000000000000000(0000) GS:ffff935ecfb00000(0000) knlGS:0000000000000000
[ 2293.341053] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2293.352510] CR2: 0000000000000000 CR3: 000000001340a001 CR4: 00000000000606e0
[ 2293.366743] Call Trace:
[ 2293.371704]  abd_borrow_buf+0x12/0x40 [zfs]
[ 2293.380104]  abd_borrow_buf_copy+0x28/0x70 [zfs]
[ 2293.389377]  zio_crypt_copy_dnode_bonus+0x36/0x130 [zfs]
[ 2293.400041]  arc_buf_fill+0x3ff/0xb60 [zfs]
[ 2293.408449]  ? zfs_btree_add_idx+0xd0/0x200 [zfs]
[ 2293.417889]  arc_untransform+0x1c/0x70 [zfs]
[ 2293.426461]  dbuf_read_verify_dnode_crypt+0xec/0x160 [zfs]
[ 2293.437466]  dbuf_read_impl.constprop.29+0x4ad/0x6b0 [zfs]
[ 2293.448423]  ? kmem_cache_alloc+0x167/0x1d0
[ 2293.456776]  ? __cv_init+0x3d/0x60 [spl]
[ 2293.464671]  ? dbuf_cons+0xa7/0xc0 [zfs]
[ 2293.472497]  ? spl_kmem_cache_alloc+0x108/0x7a0 [spl]
[ 2293.482583]  ? _cond_resched+0x15/0x30
[ 2293.490071]  ? _cond_resched+0x15/0x30
[ 2293.497542]  ? mutex_lock+0xe/0x30
[ 2293.504402]  ? aggsum_add+0x17a/0x190 [zfs]
[ 2293.512810]  dbuf_read+0x1b2/0x520 [zfs]
[ 2293.520672]  ? dnode_hold_impl+0x350/0xc20 [zfs]
[ 2293.529904]  dmu_bonus_hold_by_dnode+0x126/0x1a0 [zfs]
[ 2293.540186]  receive_object+0x403/0xc70 [zfs]
[ 2293.548906]  ? receive_freeobjects.isra.10+0x9d/0x120 [zfs]
[ 2293.560049]  receive_writer_thread+0x279/0xa00 [zfs]
[ 2293.569962]  ? set_curr_task_fair+0x26/0x50
[ 2293.578319]  ? receive_process_write_record+0x190/0x190 [zfs]
[ 2293.589793]  ? __thread_exit+0x20/0x20 [spl]
[ 2293.598317]  ? thread_generic_wrapper+0x6f/0x80 [spl]
[ 2293.608410]  ? receive_process_write_record+0x190/0x190 [zfs]
[ 2293.619882]  thread_generic_wrapper+0x6f/0x80 [spl]
[ 2293.629609]  kthread+0x112/0x130
[ 2293.636053]  ? kthread_bind+0x30/0x30
[ 2293.643351]  ret_from_fork+0x35/0x40
[ 2293.650473] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter veth pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) nf_tables nfnetlink vboxdrv(OE) bridge binfmt_misc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) intel_rapl x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel kvm irqbypass crct10dif_pclmul ib_iser joydev crc32_pclmul rdma_cm ghash_clmulni_intel iw_cm intel_cstate ib_cm intel_uncore ib_core intel_rapl_perf configfs ipmi_si sg ipmi_devintf iTCO_wdt iTCO_vendor_support pcc_cpufreq intel_pch_thermal iscsi_tcp ipmi_msghandler libiscsi_tcp libiscsi evdev scsi_transport_iscsi pcspkr tun nfsd auth_rpcgss nfs_acl lockd grace sunrpc lm85 dme1737 hwmon_vid iptable_nat ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack
[ 2293.793008]  nf_defrag_ipv6 nf_defrag_ipv4 fuse loop 8021q garp stp mrp llc ecryptfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 raid10 uas usb_storage raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear hid_generic raid1 usbhid hid md_mod sd_mod ast ahci ttm libahci libata drm_kms_helper drm crc32c_intel igb i2c_i801 dca i2c_algo_bit scsi_mod lpc_ich mfd_core e1000e xhci_pci ehci_pci xhci_hcd ehci_hcd usbcore usb_common thermal fan video button
[ 2293.895677] CR2: 0000000000000000
[ 2293.902280] ---[ end trace 164c64ca87be80af ]---
[ 2294.020926] RIP: 0010:abd_verify+0x5/0x60 [zfs]
[ 2294.029975] Code: 0f 1f 44 00 00 0f 1f 44 00 00 8b 07 c1 e8 05 83 e0 01 c3 66 90 0f 1f 44 00 00 8b 07 c1 e8 06 83 e0 01 c3 66 90 0f 1f 44 00 00 <8b> 07 a8 01 74 01 c3 a8 40 74 43 41 54 4c 8d 67 68 55 53 48 8b 47
[ 2294.067433] RSP: 0018:ffffb12e4b6d7a28 EFLAGS: 00010246
[ 2294.077850] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2294.092082] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[ 2294.106312] RBP: 0000000000004000 R08: ffff935ec10b70b0 R09: 0000000000000000
[ 2294.120542] R10: 0000000000007130 R11: ffff935d75f984e0 R12: 0000000000004000
[ 2294.134774] R13: 0000000000000000 R14: ffffffffc0fea550 R15: 0000000000000020
[ 2294.149006] FS:  0000000000000000(0000) GS:ffff935ecfb00000(0000) knlGS:0000000000000000
[ 2294.165144] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2294.176600] CR2: 0000000000000000 CR3: 000000001340a001 CR4: 00000000000606e0

Interestingly, the transfer continues happily but just everything else in the system hangs. The only way to recover is resetting the machine (since not even reboot works).

Describe how to reproduce the problem

It's a tough one. It seems to me that the issue might be load related in some sense since it only occurs if I have two zfs send's (via syncoid) running in parallel that have to do with encrypted datasets.

Transfer 1

The first one sends datasets from an unecrypted dataset into an encrypted one (I migrate to encryption).

I use syncoid and use the command:

syncoid -r --skip-parent --no-sync-snap zpradix1imain/sys/vz zpradix1imain/sys/vz_enc

This translates into

zfs send -I 'zpradix1imain/sys/vz/main'@'zfs-auto-snap_hourly-2021-03-02-1917' 'zpradix1imain/sys/vz/main'@'zfs-auto-snap_frequent-2021-03-02-1932' | mbuffer -q -s 128k -m 16M 2>/dev/null | pv -s 16392592 | zfs receive -s -F 'zpradix1imain/sys/vz_enc/main'

Transfer 2

I transfer data from an encrypted dataset raw to a secondary server. The syncoid command is:

syncoid -r --skip-parent --no-sync-snap --sendoptions=w --exclude=zfs-auto-snap_hourly --exclude=zfs-auto-snap_frequent zpradix1imain/data root@192.168.200.12:zpzetta/radix/data

This translates into:

zfs send -w 'zpradix1imain/data/home'@'vicari-prev' | pv -s 179222507064 | lzop | mbuffer -q -s 128k -m 16M 2>/dev/null | ssh ...

In summary:

scratchings commented 2 years ago

Interesting observation. It panic'd again whilst I was preparing the above logs so I've just restarted again. Fresh system running z_upgrade only

zpool status -x

Shows data errors including

backup_pool/h9.local/fs2/fs8:<0x0>

By hand I run the recursive syncoid to backup_pool/h9.local/fs2....

sudo /usr/local/bin/syncoid --no-sync-snap --create-bookmark -r --skip-parent syncoid@h9.local:sfs1/sfs2 backup_pool/h9.local/fs2 NEWEST SNAPSHOT: autosnap_2022-01-18_12:30:01_hourly INFO: no snapshots on source newer than autosnap_2022-01-18_12:30:01_hourly on target. Nothing to do, not syncing. NEWEST SNAPSHOT: autosnap_2022-01-18_12:30:01_hourly INFO: no snapshots on source newer than autosnap_2022-01-18_12:30:01_hourly on target. Nothing to do, not syncing. NEWEST SNAPSHOT: autosnap_2022-01-18_12:30:01_hourly INFO: no snapshots on source newer than autosnap_2022-01-18_12:30:01_hourly on target. Nothing to do, not syncing. Resuming interrupted zfs send/receive from sfs1/sfs2/fs8 to backup_pool/h9.local/fs2/fs8 (~ 1.1 MB remaining): 495kiB 0:00:00 [4.18MiB/s] [=====================================================> ] 45%

Message from syslogd@myhost.local at Jan 18 13:22:21 ... kernel:VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)

Message from syslogd@fs3 at Jan 18 13:22:21 ... kernel:PANIC at dmu_recv.c:1811:receive_object()

Sorry, hadn't had chance to set the debug mode for the module.

Could these repeated crashes be due to hang overs from previous crashes and thus a multi-scrub needs to be run to completion to try and clear these? Where possible I've been deleting the 'corrupt' file systems, but in some cases we don't wish to loose the fs history. After a couple of scrubs (historically) we have seen all data 'errors' go away but as a scrub takes ca 3 days (so we'd be looking at a week of downtime) we haven't been doing this regularly.

rincebrain commented 2 years ago

You shouldn't need to delete them - I believe the errors are "just" the result of the same codepath that results in it VERIFY failing, e.g. a race that it's losing, and not actual on-disk mangling, so you should be able to scrub twice and it'll shut up about the errors, since they're not the result of actual checksum failures.

That said, it's quite interesting that you can so reliably reproduce it.

scratchings commented 2 years ago

So I start/stopped a scrub and all the deleted ones (e.g. <0x0> only) went away but the two ones listed for the FS mentioned above remained. This folder isn't particularly critical so I've blown it way and a clean sync by hand has completed without a crash. I'd seen this behaviour before coming on a sync coming from a FreeBSD 11 host, i.e. 100% reproducible crash on trying the sync. That host doesn't support resumes. It's sync has been turned off for some time as it was less important than the other file systems. In the end I destroyed the local copy and started afresh, but that hadn't completed the incrementals from the base snapshot before the first of the cascade of crashes this morning.

rincebrain commented 2 years ago

That's curious and possibly a bug that just starting and stopping a scrub made those go away, I think.

It's not impossible that it could get into a state where you manage that every time, it's just not something I've seen happen before with this bug, and would surprise me at least a little.

How big are the datasets and snapshots you're replicating? I'm particularly wondering if receiving the same thing again in a separate destination would fail the same way...

(In particular, if it would, we had some luck paring down a large dataset to just the objects that caused the error in another bug (#12720), and then they felt comfortable sharing the encrypted send stream of just those objects so it could be reproduced.

...I haven't debugged it yet, because I've been triaging encryption bugs by number of people reporting them, but.)

scratchings commented 2 years ago

The datasets range from a few GBs to 10's of TB of active data, with snapshot sizes ranging from KBs to 100's of GBs (and occasionally TBs) so whilst this is possible to test for some it certainly isn't for others.

rincebrain commented 2 years ago

@scratchings do you have any ability/interest in trying to pare this down to a minimal reproducer you could share the encrypted stream of (assuming it's always blowing up at the same point, at least)? Otherwise I could throw more locking at the problem, but without being able to reproduce it and see where it's going awry (since the point where it panics in dbuf_read is after another thread has burned the house down, I think), it'd be somewhat blind guess-and-check.

scratchings commented 2 years ago

Sorry, the existing datasets here aren't something I can share. It doesn't seem to be a particular dataset either, so I'm not sure of the value of this anyway, other than it does seem to be more crash happy if it tries to resume a failed transfer (but not always). Having said that, the receive it crashed on this morning is from a system whose ZFS is too old to support resuming transfers. Late last week I spread out my sends/receives on both ends and was able to keep the system running all weekend, but then it crashed this morning :-( Is there any value in more crash logs?

arussell commented 2 years ago

Hopping over from #8722 which appears to be a less active duplicate of this issue.

Tried to follow the advice of using unmounted datasets before receiving and using zfs recv -u as suggested earlier in this thread (turns out we were already unmounting datasets before, but without the -u flag on zfs recv). Unfortunately this does not mitigate the crashes.

It does not seem to be directly reproducible with any specific part of the encrypted streams as a restart and a retry of the exact same data will succeed until it gets into that race condition again and ends up Oopsing.

dmesg log from most recent crash just a few moments ago ``` [579652.133898] BUG: kernel NULL pointer dereference, address: 0000000000000000 [579652.133979] #PF: supervisor read access in kernel mode [579652.134031] #PF: error_code(0x0000) - not-present page [579652.134079] PGD 0 P4D 0 [579652.134110] Oops: 0000 [#1] SMP PTI [579652.134150] CPU: 1 PID: 1731906 Comm: receive_writer Tainted: P O 5.4.0-99-generic #112-Ubuntu [579652.134242] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1b 11/16/2012 [579652.134418] RIP: 0010:abd_verify+0xa/0x40 [zfs] [579652.134469] Code: ff 85 c0 74 12 48 c7 03 00 00 00 00 48 c7 43 08 00 00 00 00 5b 5d c3 e8 04 ff ff ff eb e7 c3 90 55 48 89 e5 41 54 53 48 89 fb <8b> 3f e8 0f ff ff ff 85 c0 75 22 44 8b 63 1c 48 8b 7b 20 4d 85 e4 [579652.134636] RSP: 0000:ffffaa4b69f37998 EFLAGS: 00010282 [579652.134686] RAX: 0000000000004000 RBX: 0000000000000000 RCX: 0000000000000000 [579652.134754] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000 [579652.134820] RBP: ffffaa4b69f379a8 R08: 000000000000021f R09: 000000000000033d [579652.134885] R10: 0000000000100000 R11: 0000000000000000 R12: 0000000000004000 [579652.134948] R13: ffff8ce55d35e960 R14: 0000000000004000 R15: 0000000000000000 [579652.135012] FS: 0000000000000000(0000) GS:ffff8ce633640000(0000) knlGS:0000000000000000 [579652.135088] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [579652.135143] CR2: 0000000000000000 CR3: 000000032ac1e002 CR4: 00000000000206e0 [579652.135212] Call Trace: [579652.135334] abd_borrow_buf+0x19/0x60 [zfs] [579652.135479] abd_borrow_buf_copy+0x1a/0x50 [zfs] [579652.135696] zio_crypt_copy_dnode_bonus+0x30/0x130 [zfs] [579652.135850] arc_buf_untransform_in_place.isra.0+0x2b/0x40 [zfs] [579652.136033] arc_buf_fill+0x1f0/0x4a0 [zfs] [579652.136185] arc_untransform+0x22/0x90 [zfs] [579652.136339] dbuf_read_verify_dnode_crypt+0xed/0x160 [zfs] [579652.136519] ? atomic_cmpxchg+0x16/0x30 [zfs] [579652.136655] dbuf_read_impl+0x117/0x610 [zfs] [579652.138249] ? atomic64_add_return+0x12/0x30 [zfs] [579652.139868] dbuf_read+0xcb/0x5f0 [zfs] [579652.141498] ? dbuf_hold_impl+0x2f/0x40 [zfs] [579652.142964] dmu_tx_check_ioerr+0x70/0xd0 [zfs] [579652.144196] dmu_tx_hold_free_impl+0x12c/0x240 [zfs] [579652.145803] dmu_tx_hold_free+0x40/0x50 [zfs] [579652.147067] dmu_free_long_range_impl+0x124/0x350 [zfs] [579652.148679] dmu_free_long_range+0x74/0xc0 [zfs] [579652.150271] dmu_free_long_object+0x27/0xc0 [zfs] [579652.151693] receive_freeobjects.isra.0+0x7a/0x100 [zfs] [579652.152955] receive_process_record+0x89/0x1c0 [zfs] [579652.154154] receive_writer_thread+0x9a/0x150 [zfs] [579652.155391] ? receive_process_record+0x1c0/0x1c0 [zfs] [579652.156821] thread_generic_wrapper+0x83/0xa0 [spl] [579652.158468] kthread+0x104/0x140 [579652.159960] ? clear_bit+0x20/0x20 [spl] [579652.161287] ? kthread_park+0x90/0x90 [579652.162557] ret_from_fork+0x35/0x40 [579652.163795] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs cpuid cfg80211 binfmt_misc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) intel_powerclamp coretemp kvm_intel kvm ipmi_si intel_cstate ipmi_devintf ipmi_msghandler serio_raw joydev input_leds ioatdma i7core_edac i5500_temp dca mac_hid sch_fq_codel msr ip_tables x_tables autofs4 btrfs zstd_compress dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 drm_vram_helper i2c_algo_bit ttm drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect aesni_intel gpio_ich mptsas sysimgblt hid_generic crypto_simd fb_sys_fops uas mptscsih usbhid cryptd glue_helper psmouse drm usb_storage hid i2c_i801 pata_acpi lpc_ich mptbase e1000e scsi_transport_sas [579652.176894] CR2: 0000000000000000 [579652.178309] ---[ end trace 3361ad432b8a694a ]--- [579652.179753] RIP: 0010:abd_verify+0xa/0x40 [zfs] [579652.181071] Code: ff 85 c0 74 12 48 c7 03 00 00 00 00 48 c7 43 08 00 00 00 00 5b 5d c3 e8 04 ff ff ff eb e7 c3 90 55 48 89 e5 41 54 53 48 89 fb <8b> 3f e8 0f ff ff ff 85 c0 75 22 44 8b 63 1c 48 8b 7b 20 4d 85 e4 [579652.183872] RSP: 0000:ffffaa4b69f37998 EFLAGS: 00010282 [579652.185271] RAX: 0000000000004000 RBX: 0000000000000000 RCX: 0000000000000000 [579652.186662] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000 [579652.188212] RBP: ffffaa4b69f379a8 R08: 000000000000021f R09: 000000000000033d [579652.189898] R10: 0000000000100000 R11: 0000000000000000 R12: 0000000000004000 [579652.191924] R13: ffff8ce55d35e960 R14: 0000000000004000 R15: 0000000000000000 [579652.193869] FS: 0000000000000000(0000) GS:ffff8ce633640000(0000) knlGS:0000000000000000 [579652.195718] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [579652.197111] CR2: 0000000000000000 CR3: 000000032ac1e002 CR4: 00000000000206e0 ```

Edit to add: At one point we suspected it might be an issue with the ARC as it was mentioned in the stacktraces, so tried to rule this out by setting primarycache to none on the receiving file system. It didn't help unfortunately.

rincebrain commented 2 years ago

Yeah, some of the crashes are more deterministic than others.

But I promise you, even with primarycache=none, some things still go through ARC, and that is certainly an ARC crash. I even know what's happening, though not which part of the code is incorrectly configured.

Very simply, there's an object type that's special cased everywhere to never, ever be duplicated, so when the code would normally make a copy of it, it instead just shrugs and gives everyone the same one.

Unfortunately, one of the codepaths involves temporarily clobbering the pointers to data in the structure and then reconstructing them at a later point in the code flow (the codepath is writing things out to disk and then since the buffers are now stale, rereading them anew later).

This is fine if you're the only one using it, less so if a bunch of people are, and if someone tries to read from it at the same time, boom goes the dynamite.

I had a WIP patch that added a big amount of locking around it, as a workaround, but then it needed more locking elsewhere too, so rather than do that I've been trying to teach it to just not leave those buffers NULL, but the code has a lot of places that assume it's NULL between where it drops out and where it reallocates...

scratchings commented 2 years ago

We are currently trying out limiting our receives to a single instance via preventing multiple runs of syncoid. N.B. our transfers aren't being sent in encrypted 'raw' form, but instead are being decryped, ssh transferred and re-encrypted on receipt, just in case that's useful information. Our machine has been up for 4 days now, a new record!

arussell commented 2 years ago

Can confirm that limiting concurrent receives does help with stability, and the greater the number of concurrent receives the more frequently crashes happen. We have seen that it can sometimes still fall over even with a single receive, so it isn't a perfect workaround, but it does mean less manual interventions to restart servers.

rincebrain commented 2 years ago

You could also use the spl parameter spl_panic_halt, and then (presuming you don't have it set to kdump instead on panic) set the sysctl "kernel.panic" to some number of seconds to reboot after an actual panic triggers.

Not my favorite workaround, but beats manual intervention required.

scratchings commented 2 years ago

We've found that limiting receives to a single instance doesn't help massively, we are still seeing panics every day or two. It seems to really affect systems that both send and receive more. Our original setup was:

All Sources > A > B

And A crashed all the time, B rarely, and we switched to:

Sources@A > A Sources@B > B A > B B > A (where originated at B)

Arranged such that A and B pull all data from sources and so can control when a receive takes place. Now we see B crash at least as regularly as A, if not more. A is running 2.0.6 with the initial locking patch, B is running 2.0.7 stock, mainly because my attempts to build either release with the patch against the current Centos 7.9 kernel whilst successful, resulted in an unloadable kernel module.

I'm not sure if I've mentioned before, but having noticed arussell's comment about encrypted streams, our streams are NOT encrypted (i.e. the data is decrypted, sent over SSH and re-encrypted at the target with a different key).

rincebrain commented 2 years ago

The streams being encrypted or not over the wire doesn't matter, it's whether they're being received encrypted that matters, I believe.

scratchings commented 2 years ago

Hi, RE patch c53dfa6, is this in addition to your previous patch (diff --git a/module/zfs/arc.c b/module/zfs/arc.c index 6ae67c99872..c309204a9f7 100644) or is the earlier patch no longer necessary?

rincebrain commented 2 years ago

I wouldn't suggest trying anything unless I explicitly come by and suggest it, as it's all haphazard experiments.

Unfortunately, just fixing the NULL dereference doesn't save you from different assertion failures later, so...

scratchings commented 2 years ago

OK, I'll hold fire on trying this out. We're just keen to have a resolution.

rincebrain commented 2 years ago

Oh, I know. I have no more desire for people to keep reporting the same problem and get a shrug back than you do to roll the bones every time you do a zfs recv.

I'm testing another couple of workarounds, having isolated which feature interaction is to blame for this in at least my reproducers, but not what the correct behavior should be instead, so they just disable it, which, by definition, breaks a few things. I'll link a patch once I'm confident in the next day or so it doesn't secretly burn the house down worse, if people really want to try it, but obviously I can't test that my reproduction setup is the same reason everyone else's is breaking.

scratchings commented 2 years ago

Great, I'll keep an eye out for it. Thanks

lheckemann commented 2 years ago

@scratchings could you edit your big comment to put the base64 data in a <details>? That will make this issue easier to browse again :)

scratchings commented 2 years ago

Done

rincebrain commented 2 years ago

I haven't forgotten what I said, but trying to fix the assertion failures and test failures that the initial workaround produced produced even more failures, and I'd rather not break half the use cases while working around this. :)

zrav commented 2 years ago

And a different stack trace for what looks like the same issue to me. Sending a specific encrypted dataset with -w -p immediately and reproducibly throws:

[471437.877758] BUG: kernel NULL pointer dereference, address: 0000000000000030
[471437.878469] #PF: supervisor read access in kernel mode
[471437.879051] #PF: error_code(0x0000) - not-present page
[471437.879586] PGD 0 P4D 0
[471437.880113] Oops: 0000 [#5] SMP NOPTI
[471437.880635] CPU: 4 PID: 3829437 Comm: zfs Tainted: P      D    O      5.15.0-30-generic #31-Ubuntu
[471437.881162] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570D4U, BIOS P1.20 05/19/2021
[471437.881693] RIP: 0010:dmu_dump_write+0x3bf/0x520 [zfs]
[471437.882288] Code: b4 0f 85 a7 00 00 00 41 c7 44 24 54 00 00 00 00 49 8b 0c 24 e9 a7 fc ff ff 31 c0 45 39 ce 0f 95 c0 44 09 c0 0f 84 7d fd ff ff <4c> 8b 3c 25 30 00 00 00 45 85 c0 0f 85 0f fd ff ff 49 c1 ef 20 49
[471437.883399] RSP: 0018:ffffa391a23af6d8 EFLAGS: 00010206
[471437.883947] RAX: 0000000001000000 RBX: ffff958fd3cb9a00 RCX: 0000000000000000
[471437.884498] RDX: 0000000000000009 RSI: 0000000000000013 RDI: ffff958fd3cb9b38
[471437.885055] RBP: ffffa391a23af728 R08: 0000000001000000 R09: 0000000000020000
[471437.885593] R10: 0000000000000009 R11: 0000000000020000 R12: ffffa391a23af868
[471437.886127] R13: 0000000000000000 R14: 0000000000020000 R15: 0000000000000000
[471437.886657] FS:  00007f9558e647c0(0000) GS:ffff95977eb00000(0000) knlGS:0000000000000000
[471437.887193] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[471437.887727] CR2: 0000000000000030 CR3: 000000064a1fe000 CR4: 0000000000350ee0
[471437.888271] Call Trace:
[471437.888804]  <TASK>
[471437.889322]  ? wait_woken+0x70/0x70
[471437.889838]  do_dump+0x374/0x510 [zfs]
[471437.890396]  ? issig+0x34/0xc0 [spl]
[471437.890907]  ? spl_kmem_free+0xe/0x10 [spl]
[471437.891410]  ? range_free+0x31/0x130 [zfs]
[471437.891959]  dmu_send_impl+0x6e7/0xbf0 [zfs]
[471437.892493]  ? dbuf_rele+0x39/0x50 [zfs]
[471437.893030]  dmu_send+0x4f8/0x840 [zfs]
[471437.893562]  ? __kmalloc_node+0x166/0x3a0
[471437.894025]  ? i_get_value_size+0x1e/0x1c0 [znvpair]
[471437.894494]  ? nvpair_value_common+0x9a/0x160 [znvpair]
[471437.894958]  zfs_ioc_send_new+0x167/0x1d0 [zfs]
[471437.895486]  ? dump_bytes_cb+0x30/0x30 [zfs]
[471437.896012]  zfsdev_ioctl_common+0x285/0x740 [zfs]
[471437.896533]  ? _copy_from_user+0x2e/0x60
[471437.896987]  zfsdev_ioctl+0x57/0xe0 [zfs]
[471437.897492]  __x64_sys_ioctl+0x91/0xc0
[471437.897932]  do_syscall_64+0x5c/0xc0
[471437.898361]  ? exit_to_user_mode_prepare+0x37/0xb0
[471437.898791]  ? syscall_exit_to_user_mode+0x27/0x50
[471437.899214]  ? __x64_sys_write+0x19/0x20
[471437.899632]  ? do_syscall_64+0x69/0xc0
[471437.900047]  ? do_user_addr_fault+0x1e3/0x670
[471437.900448]  ? syscall_exit_to_user_mode+0x27/0x50
[471437.900852]  ? exit_to_user_mode_prepare+0x37/0xb0
[471437.901250]  ? irqentry_exit_to_user_mode+0x9/0x20
[471437.901642]  ? irqentry_exit+0x19/0x30
[471437.902028]  ? exc_page_fault+0x89/0x160
[471437.902405]  ? asm_exc_page_fault+0x8/0x30
[471437.902785]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[471437.903167] RIP: 0033:0x7f95596e2aff
[471437.903544] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
[471437.904332] RSP: 002b:00007ffd7cc5ac90 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[471437.904725] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f95596e2aff
[471437.905107] RDX: 00007ffd7cc5ad10 RSI: 0000000000005a40 RDI: 0000000000000005
[471437.905478] RBP: 00007ffd7cc5e300 R08: 0000000000000000 R09: 000055a58adc59a0
[471437.905846] R10: 000055a58adc2a00 R11: 0000000000000246 R12: 0000000000005a40
[471437.906205] R13: 00007ffd7cc5ad10 R14: 0000000000005a40 R15: 000055a58adc59a0
[471437.906563]  </TASK>
[471437.906920] Modules linked in: cpuid vhost_net tap unix_diag binfmt_misc veth tls shiftfs vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink nvme_fabrics bridge ipmi_ssif nls_iso8859_1 zfs(PO) intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd snd_hda_intel zunicode(PO) snd_intel_dspcfg zzstd(O) snd_intel_sdw_acpi zlua(O) kvm_amd snd_hda_codec kvm snd_hda_core snd_hwdep zcommon(PO) znvpair(PO) rapl wmi_bmof efi_pstore snd_pcm zavl(PO) icp(PO) k10temp snd_timer spl(O) snd soundcore ccp joydev input_leds acpi_ipmi ipmi_si mac_hid sch_fq_codel 8021q garp mrp dm_multipath scsi_dh_rdac stp scsi_dh_emc llc scsi_dh_alua nfsd ipmi_devintf ipmi_msghandler auth_rpcgss nfs_acl lockd grace msr sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov
[471437.906959]  async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cdc_ether syscopyarea sysfillrect usbnet sysimgblt usbhid fb_sys_fops hid mii crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cec igb mpt3sas rc_core aesni_intel nvme crypto_simd dca ahci raid_class xhci_pci cryptd drm scsi_transport_sas i2c_piix4 libahci nvme_core xhci_pci_renesas i2c_algo_bit wmi [last unloaded: cpuid]
[471437.912347] CR2: 0000000000000030
[471437.912875] ---[ end trace dc4b1daeff6b21e2 ]---
[471438.080797] RIP: 0010:dmu_dump_write+0x3bf/0x520 [zfs]
[471438.081442] Code: b4 0f 85 a7 00 00 00 41 c7 44 24 54 00 00 00 00 49 8b 0c 24 e9 a7 fc ff ff 31 c0 45 39 ce 0f 95 c0 44 09 c0 0f 84 7d fd ff ff <4c> 8b 3c 25 30 00 00 00 45 85 c0 0f 85 0f fd ff ff 49 c1 ef 20 49
[471438.082575] RSP: 0018:ffffa391c9d276e0 EFLAGS: 00010206
[471438.083148] RAX: 0000000001000000 RBX: ffff958c8695f600 RCX: 0000000000000000
[471438.083725] RDX: 0000000000000009 RSI: 0000000000000013 RDI: ffff958c8695f738
[471438.084303] RBP: ffffa391c9d27730 R08: 0000000001000000 R09: 0000000000020000
[471438.084884] R10: 0000000000000009 R11: 0000000000020000 R12: ffffa391c9d27870
[471438.085461] R13: 0000000000000000 R14: 0000000000020000 R15: 0000000000000000
[471438.086043] FS:  00007f9558e647c0(0000) GS:ffff95977eb00000(0000) knlGS:0000000000000000
[471438.086628] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[471438.087215] CR2: 0000000000000030 CR3: 000000064a1fe000 CR4: 0000000000350ee0
rincebrain commented 2 years ago

Okay, here we go.

Please read the entire post before choosing to try this, I have tried to enumerate and avoid possible negative effects, but that does not necessarily mean they can't happen, or that I got them all.

So, it appears the culprit for why the encrypted send/recv code semi-randomly panics for some people on NULL dereferencing when it shouldn't is... ...quota accounting?

No, not the feature flag "userobj_accounting", all the way back to ZFS filesystem revision 4: 4 userquota, groupquota properties

How does this functionality get implemented? By going through after e.g. send/recv and marking every single object dirty, so it gets evaluated in the dirty list and processed for quota.

Why is that a problem? Good question, if you figure it out before I do, let me know. My speculation at the moment is that something is in a weird state in memory where it's not being treated as dirty but has dirty data, and the dbuf_read et al in dmu_buf_will_dirty are clobbering it, but I have no good proof yet of any particular cause, just that the problem stops if I make it stop running this segment of code.

In the meantime, I noticed that:

So, this patch just stubs out any call into dmu_objset_space_upgrade to return (ENOTSUP, in the cases where it has a return code) instead of upgrading at all. There's a tunable if you want to test with and without stopping calls into that.

Could I have made it have a different exit status? Yes. Could I have lied and told it the accounting was now complete? Also yes. But either of those seems like it would require more hand-massaging later to recover from.

Now the traditional question - "what broke"?

what broke 11679 workaround

So, anything to do with user/group/projectquota will fail tests, encryption errata #4 tests failed assertions sometimes so I would avoid this if you have any datasets affected by that. I can't tell you if the failures in the above screenshot that aren't those features are unrelated, because they didn't happen for me, and Github is hanging on trying to download the logfiles from any of multiple machines and I cannot tell you why.

It hasn't panicked my three testbeds in ~12h of running ZTS in loops, between a couple of reasonably fast x86 and the little SPARC reproducer from before, but that doesn't mean it can't. I don't think this can do any worse than just break your accounting and possibly cause an ASSERT/VERIFY failure to hang something, similar to the roll of the dice you're already taking with using native encryption send/recv, but I am not all knowing.

I'm not planning to try opening a PR for this because it's a terrible hack. Hopefully I or someone else figures out pretty soon why this particular code, which long predates native encryption, breaks on interacting with native encryption...

scratchings commented 2 years ago

Hi,

I've been running the no-quota version for a few days now and had my first crash last night BUT it wasn't the panic, but a re-occurance of an OOM-kill event on zpool that we were experiencing on 0.8 (the occurance of which encouraged me to try upgrading to 2.0 in the first place). So this is positive, but still not great for me. I have need to migrate my data from one system to another, if I do this using the no-quota release then switch back to the quota-on version will there be a way to get the system to calculate the quota objects? I note that z_upgrade is running on the freshly rebooted machine - my understanding was that this was recalculating quota usage, is this correct?

rincebrain commented 2 years ago

AIUI, the accounting metadata is not included if you're doing send/recv (which is what i'm assuming you mean by migrating), so it'll go happily calculate it on recv.

If you mean without send/recv, I believe one could patch it to just act like whenever you mount a dataset, the accounting metadata isn't there, and go make it anew, but I don't have that laying around at the moment.

clhedrick commented 2 years ago

I'm a bit concerned that the discussion in the leadership meeting for June, and possibly the recent explanation here, seem to see this as happening on the receive side. We lost a file system recently on the send side. So it looks like there are problems on both send and receive. For the moment we've backed out of encryption completely. Our systems are key to our department; we can't afford failures.

rincebrain commented 2 years ago

My advice of not using native encryption hasn't changed in a long time, and probably won't for years after all the known issues are fixed, given the lack of work being put into it. This bug in particular was reported before it was even in a stable release, originally (#8722), but here we are 3 years later.

My reproducer on Linux is on the recv side, but the workarounds discussed there are not specific to it. If you have a crash backtrace that isn't from the receive side, I'd love to see it, but I don't immediately see anyone having already reported that that I've forgotten.

scratchings commented 2 years ago

The 'no quota' patch has not helped with our stability, it may even be worse. We don't see the same zfs panics, it's mostly OOM Killer events which end with the death of zpool and complete system hang-up, no obvious growth in RAM usage by anything in a top session running at the times of crashing, but we have seen a kernel panic after a cascade of OOM kills. Today we had a silent hang up on the other machine running this version - no console messages just totally unresponsive. We've also seen a scrub generate 'corrupt' dataset warnings for various snapshots when a client machine (who wasn't sending at the time due to the backup host being crashed) was under severe memory shortage (kswapd was in full flow). Once we'd rectified the memory shortage (by reducing ARC and dropping kernel caches) further scrubs detected more corruption. Having removed the snapshots and rebooting all is now well again on that system. This machine is running stock 2.0.7/Centos 7.9 but with an earlier pool version (should we need to revert to an earlier version).

rincebrain commented 2 years ago

Interesting.

That might actually be a separate problem related to why you're hitting this bug and many people don't - I was previously able to reproduce this only on a very memory-limited system about half the time right before it panicked from triggering the OOM killer a bunch and then giving up because memory was so fragmented it couldn't satisfy some kernel allocations even after that. But I've had systems running that patch doing send/recvs in a loop for weeks without RAM growing, so I don't think it inherently causes that kind of problem in general...

I'm not clear, are you saying that it's writing incorrect data when it's in severe memory pressure, and you're finding it later, or something else? I'm not immediately aware of any bugs that would do that existing in a long time. (The closest I can think of is #13555, but that's very specialized, and doesn't explain the original source of the errors.)

I could share one of the patches I have that avoids the panicking behavior without disabling quota calculation, but since what you're describing seems like it's not necessarily related to my patch except that your system stays up longer, I'm not sure it would be of benefit to you.

It would be useful if you could share the OOM output, and/or if you logged /proc/buddyinfo /proc/slabinfo and /proc/spl/kmem/slab over time, as well as /proc/spl/kstat/zfs/abdstats, to try and see where all the memory is going...I don't recall seeing any drastic reports of memory leakage...

scratchings commented 2 years ago

Of the second, client machine, it was running it's monthly scheduled scrub and ZED sent an email to notify of the corrupt content. Because of my previous experiences I destroyed the snapshots in question and rescrubbed (with concomitant identification of further 'damaged' snapshots), but in hindsight I guess I should have rebooted without removing them and then scrubbed to see if these disappeared from the corruption list (as has been the case on the more heavily effected machines). When I initially logged in kswapd was utilising most of a CPU and swap had been used up. It's not a light-memory machine (96GB) with a 32GB arc_max (now - was 56GB), ~11TB pool but does have some fairly high memory load processes (e.g. Java JVM). No matter what I did to reduce memory usage: drop ARC max size, drop file system caches, which got swapping under control, scrubs still returned errors, only a reboot fixed it, I guess that ties in with fragmentation.

As to the main server issue, I'll setup some captures as you suggest, but my logging so far suggests it's a fairly instantaneous issue. The system is apparently too far gone for any OOM Killer messages to make it into the system logs, so all I have is the truncated output in the console. An example is attached.

Screenshot 2022-06-15 at 09 13 54

This machine is much lighter on the RAM front - 64GB, 32GB arc_max, 654TB pool, but other than a few periodic rsyncs and a test minio instance, it runs nothing else.

rincebrain commented 2 years ago

Curious.

That's...a really fast OOM trigger. 45 seconds after boot...I wonder why I've run that patch for extended intervals without any issue and you've not. It'd be interesting to know where your system's RAM is going, in general.

Does it stop OOMing like that with the module parameter toggled to off?

scratchings commented 2 years ago

Indeed. Can't quite remember the scenario that triggered that particular one, but I have seen some situations where attempts to resume an in-progress at time of crash receive results in a repeatable instant crash. Having said that, it takes more than 45s to boot and import the pool on a normal boot, so I don't understand the timings. I have seen other situations where there are much more than just zpool being killed, hard to tell more due to the console output disappearing off the top.

Do you mean /proc/sys/kernel/hung_task_timeout_secs ?

rincebrain commented 2 years ago

No; bd8a978a5447c5054e562fea0b3e9138dd4c9f7 adds a new module parameter, which I believe shows up as zfs_dmu_objset_no_upgrades, - and if it's set to 0 (which is not the default), the code should behave like it did without the patch entirely.

scratchings commented 2 years ago

OK, I don’t have that patch applied to the version I’m currently running.

Yesterday the mirror of the main server (which is also running the quota support removed patch) suffered a zfs stuck event requiring a forced reboot. I was able to get a few snippets of information, it certainly hadn’t run out of RAM, 'mount' returns an input/output error on /etc/mtab. Attached is the dmesg output zfs_softlockup.dmesg.gz

As to the patch, should that be applied to vanilla 2.0.7 or to your fork (which the system is currently running)?

rincebrain commented 2 years ago

The commit linked should be in the quota disable patch already; you shouldn't need to apply it to something.

You actually just replied as I was figuring out how to phrase "here's another workaround I have that I haven't found a way to crash yet, feel free to try it and see if you also OOM with that or not; if you don't want to try a random patch again I won't be offended." 92cc3d2a7503f87c93769d2cf817089a1971213d is the change; it should apply cleanly modulo exact line numbers to basically anything since that function hasn't really churned, and if it doesn't, you could just copy and paste the entire new contents of the function over. That change is independent of the quota disabling; you shouldn't need the quota disabling with that patch.

I'll go look at the log as well.

edit: Neat, that seems, at a quick attempted understanding, like it's getting stuck trying to unmount the old filesystem on doing a recv and clone-swap-in, I'm not immediately sure why, since I think the quota disable should just behave the same as if there's no quota enabled at all unless there's a functionality check I've overlooked...I may go looking for that later, thanks, though I'll probably prioritize trying to get a proper fix instead of these hacks done first...

clhedrick commented 2 years ago

You asked for a backtrace of our send-side failure. Here it is. The next line in kern.log is from a reboot an hour later. There are no further backtraces. At that point our problems are with ZFS at a user level. I don't have a full narrative of everything we saw. I apologize. The system has 512 G of memory, a bunch of disks as RAIDZ1's, a mirror of SSD as special for metadata, and a mirror of SSD as slog. Quotas are in use, but I don't know whether the file system that was being sent used quotas.

Recovery options were limited by the fact that it takes a few days to do a scrub, but downtime of the main file systems on it are a real problem. So I went to a mode of recovery I was confident would work (rebuilding the whole thing from backup, with the most commonly used file systems first, and no encryption). The backup system was also encrypted (that's where the send was going), but the restore was unencrypted. I believe at the time the system was around 500 TB, with 200 TB in use.

I believe the file system was being sent and simultaneously used by NFS.

This is Ubuntu 20, with the ZFS that comes with it.

zfs.bug.txt

I trust patches to fix this won't destabilize unencrypted use. While I'd like encryption to work, I need the unencrypted system we're now using to be reliable. I'd rather not use encryption than have any risk to unencrypted operation.

rincebrain commented 2 years ago

Anything that got merged would not be at the cost of stability in general, no - even if I wanted to, I would be shouted down by any number of people.

Anything I'm personally posting here for people to try, the best guarantee I've got is "I've run it on my testbeds for a bit and couldn't break it beyond any ways I documented."

Interestingly, your log mentions an l2arc thread backtrace, which you did not in your list of devices in the pool. I don't think I expect that to be active if there's no L2ARC devices present, but I admit I've never checked.

More generally, my guess would be that you hit another route to the same sort of stupid failure mode as the other panic that I've not seen before, in which you end up with a reference to something which shouldn't be accessible, and it changes under you halfway through. Ew.

clhedrick commented 2 years ago

I don’t have l2arc in the new configuration. I think I tried it before to see if it would help. It didn’t. Our arc hit rate is virtually 100%.

On Jul 14, 2022, at 6:36 PM, Rich Ercolani @.***> wrote:



Anything that got merged would not be at the cost of stability in general, no - even if I wanted to, I would be shouted down by any number of people.

Anything I'm personally posting here for people to try, the best guarantee I've got is "I've run it on my testbeds for a bit and couldn't break it beyond any ways I documented."

Interestingly, your log mentions an l2arc thread backtrace, which you did not in your list of devices in the pool. I don't think I expect that to be active if there's no L2ARC devices present, but I admit I've never checked.

More generally, my guess would be that you hit another route to the same sort of stupid failure mode as the other panic that I've not seen before, in which you end up with a reference to something which shouldn't be accessible, and it changes under you halfway through. Ew.

— Reply to this email directly, view it on GitHubhttps://github.com/openzfs/zfs/issues/11679#issuecomment-1184959087, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAORUCGTKN5PMUR6B5LUQE3VUCI6DANCNFSM4YPT72KQ. You are receiving this because you commented.Message ID: @.***>

scratchings commented 2 years ago

Hi,

Sorry to report that https://github.com/openzfs/zfs/commit/92cc3d2a7503f87c93769d2cf817089a1971213d hasn't solved our issues. It's running on backupsvr1 and on it's mirror backupsvr2. Yesterday, after seeing completely repeatable lockups when receiving (on backupsvr2) to a dataset listed as corrupt on backupsvr2 I destroyed all datasets listed as bad and then started a manual recursive send from backupsvr1 to backupsvr2. This involved the fresh transfer of a 3.5TB dataset which was removed from svr2. Around 4 am today, after transferring 3.25TB backupsvr1 (the sender) started displaying stuck zfs process messages and shortly after the SSH connection was dropped. There are no syslog messages associated with this hung zfs process, no OOM killer events, it just hung. I was able to attempt a login at the local console, but once the password was provided it hung too. All I see is:

Message from syslogd@backupsvr1 at Jul 15 04:24:36 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [zfs:78499]

repeated every 30s for 30 minutes and then:

client_loop: send disconnect: Broken pipe

as the SSH connection fails.

OK, further information, it is almost certainly not send that was at fault, although I have various locks and process checks to avoid multiple syncoid receives at the same time, I don't prevent a receive from running concurrently with a send. The logs do show that a receive started running at the time the zfs stuck messages started. The dataset being sent was not the dataset being received at the time.

The restarted machine has now almost immediately crashed... OOM Killer event, but I hadn't had a chance to start the monitoring of ZFS memory usage as it had only been up a few minutes.

SSH session shows:


Message from syslogd@backupsvr1 at Jul 15 09:32:56 ...
 kernel:VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)

Message from syslogd@backupsvr1 at Jul 15 09:32:56 ...
 kernel:PANIC at dmu_recv.c:1806:receive_object()

Message from syslogd@backupsvr1 at Jul 15 09:32:56 ...
 kernel:Kernel panic - not syncing: VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)

And this is repeatable if I start the same receive process. I've seen this before, sometimes I can recover the situation by avoiding a resumable receive, sometimes I have to remove the last snapshot on the receiving end...


Resuming interrupted zfs send/receive from vm_pool/dataset to backup_pool/vm_pool/dataset (~ 9.6 MB remaining):
   1MiB 0:00:01 [ 955kiB/s] [=================>                                                                                                                                                                        ] 10% ETA 0:00:08
Message from syslogd@backupsvr1 at Jul 15 09:42:07 ...
 kernel:VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)

Message from syslogd@backupsvr1 at Jul 15 09:42:07 ...
 kernel:PANIC at dmu_recv.c:1806:receive_object()

Message from syslogd@backupsvr1 at Jul 15 09:42:07 ...
 kernel:Kernel panic - not syncing: VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
6.22MiB 0:00:01 [4.96MiB/s] [======================================================================================================================>                                                                   ] 64%

And with no-resume...


# /usr/local/bin/syncoid -r --no-sync-snap --create-bookmark --no-resume --recvoptions="u" backupuser@sourceserver:vm_pool/dataset backup_pool/vm_pool/dataset
NEWEST SNAPSHOT: autosnap_2022-07-15_08:30:02_hourly
INFO: no snapshots on source newer than autosnap_2022-07-15_08:30:02_hourly on target. Nothing to do, not syncing.
NEWEST SNAPSHOT: autosnap_2022-07-15_08:30:02_hourly
INFO: no snapshots on source newer than autosnap_2022-07-15_08:30:02_hourly on target. Nothing to do, not syncing.
NEWEST SNAPSHOT: autosnap_2022-07-15_08:30:02_hourly
INFO: no snapshots on source newer than autosnap_2022-07-15_08:30:02_hourly on target. Nothing to do, not syncing.
NEWEST SNAPSHOT: autosnap_2022-07-15_08:30:02_hourly
INFO: no snapshots on source newer than autosnap_2022-07-15_08:30:02_hourly on target. Nothing to do, not syncing.
NEWEST SNAPSHOT: autosnap_2022-07-15_08:30:02_hourly
INFO: no snapshots on source newer than autosnap_2022-07-15_08:30:02_hourly on target. Nothing to do, not syncing.
NEWEST SNAPSHOT: autosnap_2022-07-15_08:30:02_hourly
Sending incremental vm_pool/dataset@autosnap_2022-07-15_01:30:02_hourly ... autosnap_2022-07-15_08:30:02_hourly (~ 60.3 MB):
cannot receive incremental stream: destination backup_pool/vm_pool/dataset contains partially-complete state from "zfs receive -s".
   1MiB 0:00:00 [1.71MiB/s] [>                                                                                                                                                                                         ]  1%            
mbuffer: error: outputThread: error writing to <stdout> at offset 0x84000: Broken pipe
mbuffer: warning: error during output to <stdout>: Broken pipe
WARN: resetting partially receive state
   1MiB 0:00:01 [ 948kiB/s] [>                                                                                                                                                                                         ]  1% ETA 0:00:59
Message from syslogd@backupsvr1 at Jul 15 09:47:33 ...
 kernel:VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)

Message from syslogd@backupsvr1 at Jul 15 09:47:33 ...
 kernel:PANIC at dmu_recv.c:1806:receive_object()

Message from syslogd@backupsvr1 at Jul 15 09:47:33 ...
 kernel:Kernel panic - not syncing: VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)

Then if I delete the snapshot autosnap_2022-07-15_01:30:02_hourly from backupsvr1 the same syncoid run completes without crashing the server.

Attached is the memory info I have for the period post 4am today, sadly lacking the OOM event. zfs.log.gz

rincebrain commented 2 years ago

@clhedrick right, I meant to reply asking you to open a new issue about this and forgot - because it might be the same issue and then we just consider it a dupe and move on, but the backtrace doesn't look the same, so I'd personally suggest treating it as independent until proven otherwise.

@scratchings how unfortunate. i'm continuing to try to investigate it more usefully than quick workarounds, but i only have so much spare time to burn at the moment.

xofyarg commented 2 years ago

I'm having a similar issue while applying a workaround for #13709, to get user accounting MAC recalculated. Basically I was incrementally send | recv datasets with snapshots locally under the same encryption root, based on zfs-2.1.5. I don't remember I've never seen this kind of panic in 2.0.1. Got it twice already. Not sure if the following stack traces will be helpful, sorry if it doesn't.

[21521.201394] BUG: kernel NULL pointer dereference, address: 0000000000000000
[21521.538780] #PF: supervisor read access in kernel mode
[21521.787641] #PF: error_code(0x0000) - not-present page
[21522.036505] PGD 0 P4D 0
[21522.159007] Oops: 0000 [#1] PREEMPT SMP PTI
[21522.361420] CPU: 1 PID: 1166 Comm: receive_writer Tainted: P           OE     5.18.0-2-amd64 #1  Debian 5.18.5-1
[21522.854838] Hardware name: <sanitized>
[21523.196386] RIP: 0010:abd_borrow_buf_copy+0x21/0x90 [zfs]
[21523.458231] Code: 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 <f6> 07 01 74 21 4c 8b 6f 48 48 8b 44 24 08 65 48 2b 04 25 28 00 00
[21524.369211] RSP: 0018:ffffab91007b79d0 EFLAGS: 00010246
[21524.622153] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[21524.968487] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[21525.314546] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
[21525.660241] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000010
[21526.006488] R13: 0000000000004000 R14: 0000000000000000 R15: 0000000000000020
[21526.352304] FS:  0000000000000000(0000) GS:ffff908a82e40000(0000) knlGS:0000000000000000
[21526.744655] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21527.023116] CR2: 0000000000000000 CR3: 000000011ddac003 CR4: 00000000001706e0
[21527.368967] Call Trace:
[21527.487014]  <TASK>
[21527.588257]  zio_crypt_copy_dnode_bonus+0x2e/0x120 [zfs]
[21527.845882]  arc_buf_fill+0x3f1/0xcd0 [zfs]
[21528.048431]  arc_untransform+0x1e/0x80 [zfs]
[21528.255190]  dbuf_read_verify_dnode_crypt+0xee/0x160 [zfs]
[21528.520829]  ? __cv_init+0x3d/0x60 [spl]
[21528.710640]  dbuf_read_impl.constprop.0+0x2a8/0x6a0 [zfs]
[21528.972489]  ? _raw_spin_unlock+0x15/0x30
[21529.166362]  ? dbuf_create+0x40c/0x5d0 [zfs]
[21529.373079]  ? dbuf_find+0x81/0x1d0 [zfs]
[21529.567135]  dbuf_read+0xe5/0x5f0 [zfs]
[21529.752780]  dmu_tx_check_ioerr+0x64/0xd0 [zfs]
[21529.972179]  dmu_tx_hold_free_impl+0x12f/0x240 [zfs]
[21530.212740]  dmu_free_long_range+0x24b/0x4d0 [zfs]
[21530.444640]  dmu_free_long_object+0x22/0xc0 [zfs]
[21530.672311]  receive_freeobjects+0x82/0xf0 [zfs]
[21530.896114]  receive_writer_thread+0x571/0xae0 [zfs]
[21531.136677]  ? set_next_task_fair+0x2d/0xd0
[21531.339525]  ? receive_process_write_record+0x190/0x190 [zfs]
[21531.618210]  ? receive_process_write_record+0x190/0x190 [zfs]
[21531.897164]  ? __thread_exit+0x20/0x20 [spl]
[21532.104250]  ? thread_generic_wrapper+0x59/0x60 [spl]
[21532.349137]  thread_generic_wrapper+0x59/0x60 [spl]
[21532.585701]  kthread+0xe8/0x110
[21532.738137]  ? kthread_complete_and_exit+0x20/0x20
[21532.970485]  ret_from_fork+0x22/0x30
[21533.143632]  </TASK>
[20615.365019] BUG: kernel NULL pointer dereference, address: 0000000000000000
[20615.698314] #PF: supervisor read access in kernel mode
[20615.947541] #PF: error_code(0x0000) - not-present page
[20616.196490] PGD 0 P4D 0
[20616.318900] Oops: 0000 [#1] PREEMPT SMP PTI
[20616.521202] CPU: 1 PID: 687 Comm: receive_writer Tainted: P           OE     5.18.0-2-amd64 #1  Debian 5.18.5-1
[20617.010442] Hardware name:  <sanitized>
[20617.352339] RIP: 0010:abd_borrow_buf_copy+0x21/0x90 [zfs]
[20617.613832] Code: 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 <f6> 07 01 74 21 4c 8b 6f 48 48 8b 44 24 08 65 48 2b 04 25 28 00 00
[20618.525425] RSP: 0018:ffffa0590071b9d0 EFLAGS: 00010246
[20618.778338] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[20619.124214] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[20619.470316] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
[20619.816096] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000010
[20620.161822] R13: 0000000000004000 R14: 0000000000000000 R15: 0000000000000020
[20620.508041] FS:  0000000000000000(0000) GS:ffff94c142e40000(0000) knlGS:0000000000000000
[20620.900254] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20621.178831] CR2: 0000000000000000 CR3: 000000013a2d6006 CR4: 00000000001706e0
[20621.524758] Call Trace:
[20621.642838]  <TASK>
[20621.744046]  zio_crypt_copy_dnode_bonus+0x2e/0x120 [zfs]
[20622.001424]  arc_buf_fill+0x3f1/0xcd0 [zfs]
[20622.204190]  arc_untransform+0x1e/0x80 [zfs]
[20622.410897]  dbuf_read_verify_dnode_crypt+0xee/0x160 [zfs]
[20622.676641]  ? __cv_init+0x3d/0x60 [spl]
[20622.866394]  dbuf_read_impl.constprop.0+0x2a8/0x6a0 [zfs]
[20623.127813]  ? _raw_spin_unlock+0x15/0x30
[20623.321945]  ? dbuf_create+0x40c/0x5d0 [zfs]
[20623.528657]  ? dbuf_find+0x81/0x1d0 [zfs]
[20623.722696]  dbuf_read+0xe5/0x5f0 [zfs]
[20623.908243]  dmu_tx_check_ioerr+0x64/0xd0 [zfs]
[20624.127626]  dmu_tx_hold_free_impl+0x12f/0x240 [zfs]
[20624.368741]  dmu_free_long_range+0x24b/0x4d0 [zfs]
[20624.600862]  dmu_free_long_object+0x22/0xc0 [zfs]
[20624.828694]  receive_freeobjects+0x82/0xf0 [zfs]
[20625.052169]  receive_writer_thread+0x571/0xae0 [zfs]
[20625.292944]  ? set_next_task_fair+0x2d/0xd0
[20625.495708]  ? receive_process_write_record+0x190/0x190 [zfs]
[20625.774129]  ? receive_process_write_record+0x190/0x190 [zfs]
[20626.052886]  ? __thread_exit+0x20/0x20 [spl]
[20626.260954]  ? thread_generic_wrapper+0x59/0x60 [spl]
[20626.505980]  thread_generic_wrapper+0x59/0x60 [spl]
[20626.742593]  kthread+0xe8/0x110
[20626.894721]  ? kthread_complete_and_exit+0x20/0x20
[20627.126884]  ret_from_fork+0x22/0x30
[20627.300299]  </TASK>
PaulZ-98 commented 2 years ago

Late to the party here.. but I do have an observation / question. It seems to me that in dbuf_read_verify_dnode_crypt the variable dnode_abuf, which gets passed down to arc_untransform and the rest, is getting contents set to NULL sometime after being tested for NULL in if (dnode_abuf == NULL || .... Is it safe to just reference dn_dbuf without any synchronization? The dbuf we're reading should be fine, but we're constantly poking dn_dbuf on each read, without any guarantees.

Maybe this function needs some locking around reading the dn->dn_dbuf - something like rw_enter(&dn->dn_dbuf->db_rwlock, RW_READER), once it's determined that dn_dbuf is not NULL. Does this make sense?

rincebrain commented 2 years ago

@PaulZ-98 I believe the current thought is that the problem in this case is that the buffer that's crashing in these cases is NULLing sometimes because it shouldn't have been discoverable at all, but refcounted wrong somewhere, so someone thinks it's unreferenced, and is dead wrong.

You can feel free to try to add locking to solve this, but I tried that a couple times, and couldn't find a big enough hammer...and even if locking does avoid this crash, if it being incorrectly refcounted is the problem, it will blow up elsewhere too.

ryao commented 1 year ago

And a different stack trace for what looks like the same issue to me. Sending a specific encrypted dataset with -w -p immediately and reproducibly throws:

[471437.877758] BUG: kernel NULL pointer dereference, address: 0000000000000030
[471437.878469] #PF: supervisor read access in kernel mode
[471437.879051] #PF: error_code(0x0000) - not-present page
[471437.879586] PGD 0 P4D 0
[471437.880113] Oops: 0000 [#5] SMP NOPTI
[471437.880635] CPU: 4 PID: 3829437 Comm: zfs Tainted: P      D    O      5.15.0-30-generic #31-Ubuntu
[471437.881162] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570D4U, BIOS P1.20 05/19/2021
[471437.881693] RIP: 0010:dmu_dump_write+0x3bf/0x520 [zfs]
[471437.882288] Code: b4 0f 85 a7 00 00 00 41 c7 44 24 54 00 00 00 00 49 8b 0c 24 e9 a7 fc ff ff 31 c0 45 39 ce 0f 95 c0 44 09 c0 0f 84 7d fd ff ff <4c> 8b 3c 25 30 00 00 00 45 85 c0 0f 85 0f fd ff ff 49 c1 ef 20 49
[471437.883399] RSP: 0018:ffffa391a23af6d8 EFLAGS: 00010206
[471437.883947] RAX: 0000000001000000 RBX: ffff958fd3cb9a00 RCX: 0000000000000000
[471437.884498] RDX: 0000000000000009 RSI: 0000000000000013 RDI: ffff958fd3cb9b38
[471437.885055] RBP: ffffa391a23af728 R08: 0000000001000000 R09: 0000000000020000
[471437.885593] R10: 0000000000000009 R11: 0000000000020000 R12: ffffa391a23af868
[471437.886127] R13: 0000000000000000 R14: 0000000000020000 R15: 0000000000000000
[471437.886657] FS:  00007f9558e647c0(0000) GS:ffff95977eb00000(0000) knlGS:0000000000000000
[471437.887193] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[471437.887727] CR2: 0000000000000030 CR3: 000000064a1fe000 CR4: 0000000000350ee0
[471437.888271] Call Trace:
[471437.888804]  <TASK>
[471437.889322]  ? wait_woken+0x70/0x70
[471437.889838]  do_dump+0x374/0x510 [zfs]
[471437.890396]  ? issig+0x34/0xc0 [spl]
[471437.890907]  ? spl_kmem_free+0xe/0x10 [spl]
[471437.891410]  ? range_free+0x31/0x130 [zfs]
[471437.891959]  dmu_send_impl+0x6e7/0xbf0 [zfs]
[471437.892493]  ? dbuf_rele+0x39/0x50 [zfs]
[471437.893030]  dmu_send+0x4f8/0x840 [zfs]
[471437.893562]  ? __kmalloc_node+0x166/0x3a0
[471437.894025]  ? i_get_value_size+0x1e/0x1c0 [znvpair]
[471437.894494]  ? nvpair_value_common+0x9a/0x160 [znvpair]
[471437.894958]  zfs_ioc_send_new+0x167/0x1d0 [zfs]
[471437.895486]  ? dump_bytes_cb+0x30/0x30 [zfs]
[471437.896012]  zfsdev_ioctl_common+0x285/0x740 [zfs]
[471437.896533]  ? _copy_from_user+0x2e/0x60
[471437.896987]  zfsdev_ioctl+0x57/0xe0 [zfs]
[471437.897492]  __x64_sys_ioctl+0x91/0xc0
[471437.897932]  do_syscall_64+0x5c/0xc0
[471437.898361]  ? exit_to_user_mode_prepare+0x37/0xb0
[471437.898791]  ? syscall_exit_to_user_mode+0x27/0x50
[471437.899214]  ? __x64_sys_write+0x19/0x20
[471437.899632]  ? do_syscall_64+0x69/0xc0
[471437.900047]  ? do_user_addr_fault+0x1e3/0x670
[471437.900448]  ? syscall_exit_to_user_mode+0x27/0x50
[471437.900852]  ? exit_to_user_mode_prepare+0x37/0xb0
[471437.901250]  ? irqentry_exit_to_user_mode+0x9/0x20
[471437.901642]  ? irqentry_exit+0x19/0x30
[471437.902028]  ? exc_page_fault+0x89/0x160
[471437.902405]  ? asm_exc_page_fault+0x8/0x30
[471437.902785]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[471437.903167] RIP: 0033:0x7f95596e2aff
[471437.903544] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
[471437.904332] RSP: 002b:00007ffd7cc5ac90 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[471437.904725] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f95596e2aff
[471437.905107] RDX: 00007ffd7cc5ad10 RSI: 0000000000005a40 RDI: 0000000000000005
[471437.905478] RBP: 00007ffd7cc5e300 R08: 0000000000000000 R09: 000055a58adc59a0
[471437.905846] R10: 000055a58adc2a00 R11: 0000000000000246 R12: 0000000000005a40
[471437.906205] R13: 00007ffd7cc5ad10 R14: 0000000000005a40 R15: 000055a58adc59a0
[471437.906563]  </TASK>
[471437.906920] Modules linked in: cpuid vhost_net tap unix_diag binfmt_misc veth tls shiftfs vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink nvme_fabrics bridge ipmi_ssif nls_iso8859_1 zfs(PO) intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd snd_hda_intel zunicode(PO) snd_intel_dspcfg zzstd(O) snd_intel_sdw_acpi zlua(O) kvm_amd snd_hda_codec kvm snd_hda_core snd_hwdep zcommon(PO) znvpair(PO) rapl wmi_bmof efi_pstore snd_pcm zavl(PO) icp(PO) k10temp snd_timer spl(O) snd soundcore ccp joydev input_leds acpi_ipmi ipmi_si mac_hid sch_fq_codel 8021q garp mrp dm_multipath scsi_dh_rdac stp scsi_dh_emc llc scsi_dh_alua nfsd ipmi_devintf ipmi_msghandler auth_rpcgss nfs_acl lockd grace msr sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov
[471437.906959]  async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper cdc_ether syscopyarea sysfillrect usbnet sysimgblt usbhid fb_sys_fops hid mii crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cec igb mpt3sas rc_core aesni_intel nvme crypto_simd dca ahci raid_class xhci_pci cryptd drm scsi_transport_sas i2c_piix4 libahci nvme_core xhci_pci_renesas i2c_algo_bit wmi [last unloaded: cpuid]
[471437.912347] CR2: 0000000000000030
[471437.912875] ---[ end trace dc4b1daeff6b21e2 ]---
[471438.080797] RIP: 0010:dmu_dump_write+0x3bf/0x520 [zfs]
[471438.081442] Code: b4 0f 85 a7 00 00 00 41 c7 44 24 54 00 00 00 00 49 8b 0c 24 e9 a7 fc ff ff 31 c0 45 39 ce 0f 95 c0 44 09 c0 0f 84 7d fd ff ff <4c> 8b 3c 25 30 00 00 00 45 85 c0 0f 85 0f fd ff ff 49 c1 ef 20 49
[471438.082575] RSP: 0018:ffffa391c9d276e0 EFLAGS: 00010206
[471438.083148] RAX: 0000000001000000 RBX: ffff958c8695f600 RCX: 0000000000000000
[471438.083725] RDX: 0000000000000009 RSI: 0000000000000013 RDI: ffff958c8695f738
[471438.084303] RBP: ffffa391c9d27730 R08: 0000000001000000 R09: 0000000000020000
[471438.084884] R10: 0000000000000009 R11: 0000000000020000 R12: ffffa391c9d27870
[471438.085461] R13: 0000000000000000 R14: 0000000000020000 R15: 0000000000000000
[471438.086043] FS:  00007f9558e647c0(0000) GS:ffff95977eb00000(0000) knlGS:0000000000000000
[471438.086628] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[471438.087215] CR2: 0000000000000030 CR3: 000000064a1fe000 CR4: 0000000000350ee0

@zrav I think I understand what went wrong here. The exact mechanism differs slightly depending on whether ZFS was built with assertions enabled or not.

If ZFS was built with assertions enabled, this NULL pointer dereference should be from ASSERT(!BP_IS_EMBEDDED(bp));. BP_IS_EMBEDDED(bp) will do (bp)->blk_prop. If bp is NULL, then this becomes a pointer to 0x0000000000000030, which is what you have in your backtrace.

If ZFS was not built with assertions enabled, this NULL pointer dereference should be from BP_SHOULD_BYTESWAP(bp), which does (bp)->blk_prop just like BP_IS_EMBEDDED(bp) does. :/

The issue is that dmu_dump_write() is allowed to be given a NULL bp (under circumstances that I do not currently understand). If assertions are enabled and the stream is either raw or compressed when this happens, boom. If assertions are not enabled, then things only go wrong when the stream is raw.

Clang's static analyzer reports this issue in dmu_dump_write(). I found your backtrace when searching to see if anyone had reported encountering this issue. By happy coincidence, it seems to explain what happened on your machine. That said, your issue is different from the original one reported here. It looks like it is possible to fix your issue by working on this some more, but I do not expect the fix would apply to the original issue report.

zrav commented 1 year ago

@ryao We are using the Ubuntu packages and I'm not sure whether they are compiled with or without assertions. Also we have (unexpectedly) not hit this again so I wonder if some update removed a precondition for this issue, as Ubuntu doesn't seem to have submitted a workaround for the issue either: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1940083

ryao commented 1 year ago

@ryao We are using the Ubuntu packages and I'm not sure whether they are compiled with or without assertions. Also we have (unexpectedly) not hit this again so I wonder if some update removed a precondition for this issue, as Ubuntu doesn't seem to have submitted a workaround for the issue either: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1940083

In that case, you do not have a build with assertions enabled.

rincebrain commented 1 year ago

For anyone who's not tired of experimental workarounds from me yet, you could try this branch, specifically with 654d10a25e910a8595cc22bea88207e2f6d0fcab. (The tunable just turns off the workaround in case you want to experiment or it somehow breaks you, it defaults to applying.)

As with other things, I haven't been able to deliberately trigger this on my testbed with this applied, and with the workaround turned off, it breaks pretty reliably. So feel free to give it a shot, if you like.

siilike commented 1 year ago

I potentially just ran into the same issue under zfs-2.1.7-1~bpo11+1:

[3116075.380853] BUG: kernel NULL pointer dereference, address: 0000000000000000
[3116075.380911] #PF: supervisor read access in kernel mode
[3116075.380956] #PF: error_code(0x0000) - not-present page
[3116075.381002] PGD 24a782067 P4D 24a782067 PUD 1543c6067 PMD 0 
[3116075.381052] Oops: 0000 [#1] SMP NOPTI
[3116075.381099] CPU: 0 PID: 2322703 Comm: receive_writer Tainted: P           OE     5.10.0-20-amd64 #1 Debian 5.10.158-2
[3116075.381154] Hardware name: System manufacturer System Product Name/PRIME A320M-K, BIOS 5603 10/14/2020
[3116075.381299] RIP: 0010:abd_borrow_buf_copy+0x21/0x90 [zfs]
[3116075.381351] Code: 45 42 11 00 0f 1f 44 00 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 <f6> 07 01 74 25 4c 8b 6f 48 48 8b 44 24 08 65 48 2b 04 25 28 00 00
[3116075.381417] RSP: 0018:ffffab6c00eff9b0 EFLAGS: 00010246
[3116075.381464] RAX: 0000000000000000 RBX: ffff9b8d00588c00 RCX: 0000000000000000
[3116075.381513] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[3116075.381561] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[3116075.381609] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000010
[3116075.381661] R13: 0000000000004000 R14: 0000000000000000 R15: 0000000000000020
[3116075.381714] FS:  0000000000000000(0000) GS:ffff9b8d9e600000(0000) knlGS:0000000000000000
[3116075.381761] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3116075.381802] CR2: 0000000000000000 CR3: 00000002b843a000 CR4: 00000000001506f0
[3116075.381846] Call Trace:
[3116075.381987]  zio_crypt_copy_dnode_bonus+0x2e/0x120 [zfs]
[3116075.382075]  arc_buf_fill+0x3f9/0xce0 [zfs]
[3116075.382155]  arc_untransform+0x1d/0x80 [zfs]
[3116075.382235]  dbuf_read_verify_dnode_crypt+0xf2/0x160 [zfs]
[3116075.382328]  dbuf_read_impl.constprop.0+0x2c4/0x6e0 [zfs]
[3116075.382379]  ? _cond_resched+0x16/0x50
[3116075.382482]  ? dbuf_create+0x43c/0x610 [zfs]
[3116075.382566]  dbuf_read+0xe2/0x5d0 [zfs]
[3116075.382652]  dmu_tx_check_ioerr+0x64/0xd0 [zfs]
[3116075.382735]  dmu_tx_hold_free_impl+0x12f/0x250 [zfs]
[3116075.382827]  dmu_free_long_range+0x242/0x4d0 [zfs]
[3116075.382904]  dmu_free_long_object+0x22/0xd0 [zfs]
[3116075.382979]  receive_freeobjects+0x82/0x100 [zfs]
[3116075.383066]  receive_writer_thread+0x565/0xad0 [zfs]
[3116075.383117]  ? thread_generic_wrapper+0x62/0x80 [spl]
[3116075.383161]  ? kfree+0x410/0x490
[3116075.383251]  ? receive_process_write_record+0x1a0/0x1a0 [zfs]
[3116075.383298]  ? thread_generic_wrapper+0x6f/0x80 [spl]
[3116075.383340]  thread_generic_wrapper+0x6f/0x80 [spl]
[3116075.383383]  ? __thread_exit+0x20/0x20 [spl]
[3116075.383426]  kthread+0x11b/0x140
[3116075.383467]  ? __kthread_bind_mask+0x60/0x60
[3116075.383509]  ret_from_fork+0x22/0x30
[3116075.383547] Modules linked in: nft_masq nft_fib_ipv4 nft_fib nft_chain_nat nf_nat nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_limit nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables libcrc32c crc32c_generic nfnetlink wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha usblp rpcsec_gss_krb5 binfmt_misc ipmi_devintf ipmi_msghandler amdgpu snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi edac_mce_amd kvm_amd ccp rng_core kvm snd_hda_intel snd_intel_dspcfg snd_usb_audio soundwire_intel soundwire_generic_allocation snd_soc_core snd_compress soundwire_cadence snd_hda_codec snd_usbmidi_lib snd_rawmidi snd_hda_core eeepc_wmi snd_seq_device snd_hwdep irqbypass gpu_sched asus_wmi ttm soundwire_bus mc sp5100_tco battery sparse_keymap wmi_bmof rfkill k10temp pcspkr fam15h_power watchdog snd_pcm drm_kms_helper snd_timer snd cec soundcore i2c_algo_bit
[3116075.383599]  evdev button acpi_cpufreq sg nfsd parport_pc ppdev lp auth_rpcgss parport drm nfs_acl lockd grace fuse configfs sunrpc ip_tables x_tables autofs4 zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) dm_crypt dm_mod crc32_pclmul crc32c_intel hid_generic sd_mod uas usbhid ghash_clmulni_intel hid usb_storage ahci libahci libata mpt3sas aesni_intel nvme raid_class r8169 realtek scsi_transport_sas libaes crypto_simd mdio_devres cryptd glue_helper scsi_mod i2c_piix4 nvme_core t10_pi crc_t10dif crct10dif_generic xhci_pci xhci_hcd crct10dif_pclmul crct10dif_common libphy usbcore usb_common wmi video gpio_amdpt gpio_generic
[3116075.383936] CR2: 0000000000000000
[3116075.383976] ---[ end trace d861dc1b362f1101 ]---
[3116075.384094] RIP: 0010:abd_borrow_buf_copy+0x21/0x90 [zfs]
[3116075.384136] Code: 45 42 11 00 0f 1f 44 00 00 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 <f6> 07 01 74 25 4c 8b 6f 48 48 8b 44 24 08 65 48 2b 04 25 28 00 00
[3116075.384198] RSP: 0018:ffffab6c00eff9b0 EFLAGS: 00010246
[3116075.384238] RAX: 0000000000000000 RBX: ffff9b8d00588c00 RCX: 0000000000000000
[3116075.384282] RDX: 0000000000004000 RSI: 0000000000004000 RDI: 0000000000000000
[3116075.384329] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[3116075.384373] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000010
[3116075.384417] R13: 0000000000004000 R14: 0000000000000000 R15: 0000000000000020
[3116075.384463] FS:  0000000000000000(0000) GS:ffff9b8d9e600000(0000) knlGS:0000000000000000
[3116075.384508] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3116075.384551] CR2: 0000000000000000 CR3: 00000002b843a000 CR4: 00000000001506f0
rincebrain commented 1 year ago

Looks like it, yeah. My best suggestion would be trying the patch I linked in the previous comment, at the moment - at worst, it should never be harmful, it just might in some case I haven't spotted fail to help.