permanent errors (ereport.fs.zfs.authentication) reported after syncoid snapshot/send workload

aerusso commented 3 years ago

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	testing (bullseye)
Linux Kernel	5.10.19
Architecture	amd64
ZFS Version	2.0.3-1

Describe the problem you're observing

After upgrading to zfs 2.0.3 and Linux 5.10.19 (from 0.8.6), a well-tested syncoid workload causes "Permanent errors have been detected in the following files:" reports for a pool/dataset@snapshot:<0x0> (no file given).

Removing the snapshot, and running a scrub causes the error to go away.

This is on a single-disk nvme SSD---never experiencing any problems before upgrading---and has happened twice, once after each reboot/re-running of the syncoid workload. I have since booted back into 0.8.6, ran the same workload, and have not experienced the error report.

Describe how to reproduce the problem

Beyond the above, I do not have a mechanism to reproduce this. I'd rather not blindly do it again!

Include any warning/errors/backtraces from the system logs

See also @jstenback's report of very similar symptoms: 1 and 2, which appear distinct from the symptoms of the bug report they are in. Additionally, compare to @rbrewer123's reports 1 and 2, which comes with a kernel panic---I do not experience this.

My setup is very similar: I run a snapshot workload periodically, and transfer the snapshots every day to another machine. I also transfer snapshots much more frequently to another pool on the same machine.

If valuable, I have zpool history output that I can provide. Roughly, the workload looks like many snapshot, send -I, destroy (on one pool) and receive (on the same machine, but another pool ).

aerusso commented 3 years ago

@behlendorf I'm trying to dig into this a little bit further. I want to rule out in-flight corruption of snapshot data, so I'd like to be able to get access to zb_blkid and zb_level in the zbookmark_phys associated with this error---and then actually read out the corrupted data. ~~Is there a better way than just putting a printk into spa_log_error (after avl_inser(tree, new, where); that should get called exactly once per affected block, right?).~~ I see that ZED will print this out. My only reproducer is my main desktop, so I want to be very careful.

EDIT: Can I just use zdb -R pool 0:$blkid to dump the contents of this block? (There's only one vdev) I.e., is the "offset" in zdb -R the same as the zb_blkid of zbookmark_phys?

Also, is this approach reasonable? I would think it would be helpful to know if the affected block is the meta_dnode, or a root block, etc., right? Or am I embarking on a wild goose chase?

ahrens commented 3 years ago

I'd like to be able to get access to zb_blkid and zb_level in the zbookmark_phys associated with this error

zpool events -v should also have this

Can I just use zdb -R pool 0:$blkid to dump the contents of this block? (There's only one vdev) I.e., is the "offset" in zdb -R the same as the zb_blkid of zbookmark_phys?

No, zb_blkid is the logical block ID, which is unrelated to the location on disk. You need the DVA's offset, which is also included in the zpool events -v output.

jstenback commented 3 years ago

For the record, I'm still seeing the same behavior that I reported in the issue @aerusso linked above. And I just for the first time since this started (right after updating to 2.0) saw a kernel panic that left my zfs filesystems unresponsive. Here's what I found in dmesg at that time:

[380231.914245] #PF: supervisor write access in kernel mode
[380231.914310] #PF: error_code(0x0002) - not-present page
[380231.914373] PGD 0 P4D 0 
[380231.914414] Oops: 0002 [#1] SMP PTI
[380231.914462] CPU: 3 PID: 1741442 Comm: zpool Tainted: P           OE     5.10.23-1-lts #1
[380231.914569] Hardware name: Supermicro Super Server/X10DRL-i, BIOS 2.0 12/18/2015
[380231.914664] RIP: 0010:sa_setup+0xc7/0x5f0 [zfs]
[380231.914687] Code: 7b 01 00 4d 89 8f 10 05 00 00 48 85 ff 0f 84 b4 00 00 00 4c 89 0c 24 e8 27 c5 55 c6 49 8b 87 20 05 00 00 4c 8b 0c 24 4c 89 e7 <4c> 89 48 28 49 c7 87 10 05 00 00 00 00 00 00 e8 c5 eb 55 c6 48 89
[380231.914762] RSP: 0018:ffffac37c9963bf8 EFLAGS: 00010296
[380231.914786] RAX: 0000000000000000 RBX: ffff8b12fb31d4e8 RCX: ffff8b08050e8001
[380231.914818] RDX: ffff8b05a847f910 RSI: 0000000000000003 RDI: ffff8b12fb31d508
[380231.914848] RBP: ffffac37c9963df8 R08: 0000000000000000 R09: ffff8b08050e8000
[380231.914879] R10: 0000000000000e80 R11: 0000000000000000 R12: ffff8b12fb31d508
[380231.914910] R13: 0000000000000002 R14: ffffac37c9963c38 R15: ffff8b12fb31d000
[380231.914942] FS:  00007fcadcf9a7c0(0000) GS:ffff8b10ffac0000(0000) knlGS:0000000000000000
[380231.914976] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[380231.915002] CR2: 0000000000000028 CR3: 00000002cb2b8003 CR4: 00000000003706e0
[380231.915032] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[380231.915063] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[380231.915094] Call Trace:
[380231.915161]  zfs_sa_setup.constprop.0+0x6a/0x90 [zfs]
[380231.915233]  zfs_obj_to_path+0x50/0xe0 [zfs]
[380231.915291]  ? dmu_objset_hold_flags+0x95/0xe0 [zfs]
[380231.915361]  zfs_ioc_obj_to_path+0x86/0xf0 [zfs]
[380231.915387]  ? strlcpy+0x2d/0x40
[380231.915452]  zfsdev_ioctl_common+0x71c/0x880 [zfs]
[380231.915525]  zfsdev_ioctl+0x53/0xe0 [zfs]
[380231.915550]  __x64_sys_ioctl+0x83/0xb0
[380231.915567]  do_syscall_64+0x33/0x40
[380231.915581]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[380231.915598] RIP: 0033:0x7fcadd568e6b
[380231.915612] Code: ff ff ff 85 c0 79 8b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 af 0c 00 f7 d8 64 89 01 48
[380231.915660] RSP: 002b:00007ffcdb0e43a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[380231.915682] RAX: ffffffffffffffda RBX: 000055e10285a560 RCX: 00007fcadd568e6b
[380231.915702] RDX: 00007ffcdb0e43e0 RSI: 0000000000005a25 RDI: 0000000000000003
[380231.915723] RBP: 00007ffcdb0e7ad0 R08: 0000000000000000 R09: 0000000000000000
[380231.915742] R10: 00007fcadd5e6ac0 R11: 0000000000000246 R12: 00007ffcdb0e7990
[380231.915762] R13: 00007ffcdb0e43e0 R14: 000055e102873440 R15: 0000000000002000
[380231.915783] Modules linked in: rpcsec_gss_krb5 xt_nat veth xt_MASQUERADE br_netfilter bridge stp llc target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod xt_recent ipt_REJECT nf_reject_ipv4 xt_multiport xt_comment xt_conntrack xt_hashlimit xt_addrtype xt_mark iptable_mangle xt_CT xt_tcpudp iptable_raw nfnetlink_log xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_netlink nfnetlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c vboxnetflt(OE) vboxnetadp(OE) iptable_filter vboxdrv(OE) joydev mousedev usbhid intel_rapl_msr ipmi_ssif intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp
[380231.915824]  coretemp iTCO_wdt kvm_intel intel_pmc_bxt iTCO_vendor_support kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate intel_uncore pcspkr ixgbe igb mdio_devres i2c_i801 libphy mei_me i2c_smbus ast mdio mei i2c_algo_bit ioatdma lpc_ich dca wmi mac_hid acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) vboxvideo drm_vram_helper drm_ttm_helper ttm drm_kms_helper cec syscopyarea sysfillrect sysimgblt fb_sys_fops vboxsf vboxguest crypto_user fuse drm agpgart nfsd auth_rpcgss nfs_acl lockd grace sunrpc nfs_ssc bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 mpt3sas crc32c_intel raid_class scsi_transport_sas xhci_pci xhci_pci_renesas [last unloaded: vboxdrv]
[380231.924269] CR2: 0000000000000028
[380231.925010] ---[ end trace 67d499cf5193f033 ]---
[380231.984744] RIP: 0010:sa_setup+0xc7/0x5f0 [zfs]
[380231.986002] Code: 7b 01 00 4d 89 8f 10 05 00 00 48 85 ff 0f 84 b4 00 00 00 4c 89 0c 24 e8 27 c5 55 c6 49 8b 87 20 05 00 00 4c 8b 0c 24 4c 89 e7 <4c> 89 48 28 49 c7 87 10 05 00 00 00 00 00 00 e8 c5 eb 55 c6 48 89
[380231.988429] RSP: 0018:ffffac37c9963bf8 EFLAGS: 00010296
[380231.989326] RAX: 0000000000000000 RBX: ffff8b12fb31d4e8 RCX: ffff8b08050e8001
[380231.990116] RDX: ffff8b05a847f910 RSI: 0000000000000003 RDI: ffff8b12fb31d508
[380231.990890] RBP: ffffac37c9963df8 R08: 0000000000000000 R09: ffff8b08050e8000
[380231.991657] R10: 0000000000000e80 R11: 0000000000000000 R12: ffff8b12fb31d508
[380231.992419] R13: 0000000000000002 R14: ffffac37c9963c38 R15: ffff8b12fb31d000
[380231.993182] FS:  00007fcadcf9a7c0(0000) GS:ffff8b10ffac0000(0000) knlGS:0000000000000000
[380231.993959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[380231.994731] CR2: 0000000000000028 CR3: 00000002cb2b8003 CR4: 00000000003706e0
[380231.995509] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[380231.996285] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Unfortunately I'm not running a debug build here, so that stack is only of so much value, but wanted to share nonetheless in case it provides any insight into this issue.

jstenback commented 3 years ago

For the record, I just got another kernel crash dump with the same exact behavior and stack trace in dmesg as reported in my previous comment. The dmesg this time (maybe last time too, though that didn't make it into my earlier report) states that it's a kernel NULL pointer dereference.

ahrens commented 3 years ago

@jstenback I don't see how the problem you're describing is related to this issue. @aerusso is experiencing unexpected checksum errors, and you have a null pointer dereference. Unless I'm missing something, please file a separate issue report for this.

jstenback commented 3 years ago

@ahrens I'm experiencing the exact same symptoms that @aerusso is experiencing, in addition to the two null pointer dereferences I've seen over the past week or so. I figured there's a chance they're related, but that may of course not be the case at all.

aerusso commented 3 years ago

For reference, I experienced the corruption in this report after minutes of running 2.0.3. The total time I used 2.0.3 was probably less than 2 hours. I'm guessing that @jstenback has been running the kernel for hundreds of hours. It might be that I just had not yet experienced that symptom. (Of course, it's also possible it's an unrelated bug).

jstenback commented 3 years ago

That is correct, my uptime during both of the crashes I mentioned were on the order of a hundred hours. And I typically start seeing the corruption after about the same amount of uptime.

IvanVolosyuk commented 3 years ago

@aerusso Can you also mention: what was the last 'good' version of ZFS you didn't experience the issue? Can be helpful to narrow down the search.

aerusso commented 3 years ago

@IvanVolosyuk Unfortunately, my last known good configuration is ZFS 0.8.6 and Linux 5.9.15 (and it's stable as a rock back here). I was also unsuccessful in reproducing the bug in a VM (using a byte-for-byte copy of the whole 1 TB nvme).

My current plan (once I can find a free weekend) is to try to bisect on the actual workstation exhibiting the bug. To complicate things, I'll have to do the bisection back here with Linux 5.9.15, since support for 5.10 wasn't added until very late in the release cycle.

glueckself commented 3 years ago

As I've noted in #12014, I'm running 2.0.2 (with Ubuntu 21.04, Linux 5.11.0) since 30th April and I haven't experienced any issues yet.

On my server with Debian Buster, Linux 5.10 and ZFS 2.0.3 (from backports), I've experienced the issue on 4 datasets, ~but not at the same time~, EDIT: two of them at the same time:

            zvm/secure/locker@autosnap_2021-05-05_15:00:30_frequently:<0x0>
            zvm/secure/plappermaul@autosnap_2021-04-30_20:00:30_hourly:<0x0>
            zvm/secure/sunshine-db@autosnap_2021-05-08_00:45:30_frequently:<0x0>
            zvm/secure/dovecote-root@autosnap_2021-05-08_00:45:30_frequently:<0x0>

What I've also noted in the other issue, is that after reverting back to 0.8.4, everything seemed ok. I've also managed to destroy the affected snapshots and multiple scrubs didn't detect any issues.

aerusso commented 3 years ago

I added 3f81aba76 on top of Debian's 2.0.3-8, and am tentatively reporting that I cannot reproduce this bug. I've been running for about 45 minutes now, without the permanent error (I used to experience this bug immediately upon running my sanoid workload, which at this point has run three times).

I would suggest that anyone already running 2.x consider applying that patch.

aerusso commented 3 years ago

Unfortunately my optimism was premature. After another two and a half hours, I did indeed experience another corrupted snapshot.

aerusso commented 3 years ago

After about 3.5 hours of uptime under Linux 5.9.15-1 (making sure this can be reproduced on a kernel supporting the known-good 0.8.6) with ZFS 3c1a2a945 (candidate 2.0.5 with another suspect patch reverted):

zpool events -v reveals

class = "ereport.fs.zfs.authentication"
ena = 0xb91dbe5624302801
detector = (embedded nvlist)
        version = 0x0
        scheme = "zfs"
        pool = 0xe9128a59c39360a
(end detector)
pool = "REDACTED"
pool_guid = 0xe9128a59c39360a
pool_state = 0x0
pool_context = 0x0
pool_failmode = "wait"
zio_objset = 0x83c4
zio_object = 0x0
zio_level = 0x0
zio_blkid = 0x0
time = 0x60c52c8d 0x7a6bfee 
eid = 0x116e

I failed to capture this information in my previous reports. I can reproduce this by trying to send the offending snapshot.

This dataset has encryption set to aes-256-gcm.

Also, am I correct that there is some kind of MAC that is calculated before the on-disk checksum? My pool shows no READ/WRITE/CKSUM errors---does that mean that the data and/or the MAC was wrong before being written? Should I try changing any encryption settings?

# cat /sys/module/icp/parameters/icp_gcm_impl
cycle [fastest] avx generic pclmulqdq

aerusso commented 3 years ago

More fun: I can zfs send the "corrupt" snapshot after rebooting the machine (both in 0.8.6 and the 2.0.5-ish that caused the problem; hopefully it doesn't matter, but I tested in that order). Is it possible that some cache is somehow being corrupted?

AttilaFueloep commented 3 years ago

Also, am I correct that there is some kind of MAC that is calculated before the on-disk checksum?

Yes, both AES-CCM and AES-GCM are authenticated encryption algorithms, which protect against tampering with the cipher text. Encrypting a block creates a MAC. The same MAC is created while decrypting. If they don't match decryption fails, generating an I/O error. Half of the block pointer checksum of an encrypted block is this MAC and the other half the checksum of the encrypted block.

My pool shows no READ/WRITE/CKSUM errors---does that mean that the data and/or the MAC was wrong before being written?

There were some problems with metadata(<0x0>) MACs related to ZFS user/group/project used on datasets created with certain master ZFS versions, but I'd need to lookup the details.

Should I try changing any encryption settings?

You could try to change icp_gcm_impl to generic to make sure the avx implementation didn't break something. Since it went into 0.8.4 I don't think this is likely though.

glueckself commented 3 years ago

@aerusso, to answer your questions from r/zfs (hope this is the right issue):

Are your affected datasets encrypted?

Yes. All of them are direct children of an encrypted parent and inherit its encryption.

Does this error coincide with a failed zfs send? Is there also a bunch of snapshots being taken at the same time?

Sanoid is taking snapshots of all datasets every 5 minutes. I can't find any log entry about sanoid failing to send it, however, manually running zfs send zvm/secure/sunshine-db@autosnap_2021-08-02_07:00:02_hourly > /dev/null says cannot open 'pool/secure/sunshine-db@autosnap_2021-08-02_07:00:02_hourly': I/O error. Also, trying syncoid with the entire dataset results in cannot iterate filesystems: I/O error

(Here's a real hopeful question) Do you have a specific workload that consistently reproduces the bug? (I have to wait ~hours to reproduce it, which will make bisecting this tragically hard).

Not really. I've changed the syncoid cronjob yesterday to 5 minutes and then it happened. Have you tried reproducing it with really frequent snapshots and sending, something like while true; do uuid > /mnt/pool/dataset/testfile; zfs snapshot pool/dataset@$(uuid); done & while true; do sanoid ...; done? (the testfile write so that the snapshots have some content)

Can you grab a zpool events -v output relatively quickly after any such permanent errors occur? (you can also enable the all-debug.sh zedlet)? Any such output might be very useful.

If possible, please try rebooting, without deleting the snapshot, and accessing it (i.e., zfs send). Does the snapshot send without issue? If so, try scrubbing the pool twice (the pool doesn't forget the errors until there are two clean scrubs). This worked for me, by the way. This makes me suspect that there may be no on-disk corruption at all.

~~To answer the last two: no, after rebooting I get invalid argument when I try to zfs send one of the datasets.~~ Too bad I've read too late that this seems to be an issue with zfs writing to /dev/null (see #11445), so send would've worked probably. Scrubbing it twice resolved the corruptions, thanks!

~~I'm moving everything away from that server right now, afterwards I'll reboot, test that, try to clean it up, then start a zpool events -vf > somelogfile and see if my really-frequent-snapshots/sends-method can reproduce it. ~~ ~~I'm setting up a VM to play around, I don't want to break my hosts pool... ~~ ~~I'll let you know when I have something more. ~~

Ok, so I've had following stuff running for the last few hours: while true; do for i in $(seq 1 10); do zfs snapshot pool1/secure/test${i}@$(uuid); sleep 1; done; done while true; do for i in $(seq 1 10); do dd if=/dev/urandom of=/test/${i}/blafile bs=1M count=5 & done; sleep 1; done while true; do syncoid -r pool1/secure root@192.168.2.73:pool2/secure/t1-test-2; sleep 1; done without any corruption so far. This was running inside a VM, which used a SATA SSD, instead of the NVMe SSDs my normal VMs use. I can waste that SSD, so if you have any suggestions to stress test snapshots & sending, let me know.

glueckself commented 3 years ago

I have one more occurrence, however this time no sends were involved. When I woke up today, my backup server was offline and every I/O seemed to hang (i.e., after I type root + password it would hung and my services were offline). After a hard reboot, every active dataset had snapshot corruptions, which went away after scrubbing twice.

Also, in the dmesg I've got the following warning:

Aug 08 00:29:01 host.example.com kernel: VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
Aug 08 00:29:01 host.example.com kernel: PANIC at arc.c:3790:arc_buf_destroy()
Aug 08 00:29:01 host.example.com kernel: Showing stack for process 482
Aug 08 00:29:01 host.example.com kernel: CPU: 0 PID: 482 Comm: z_rd_int Tainted: P          IO      5.11.0-25-generic #27-Ubuntu
Aug 08 00:29:01 host.example.com kernel: Hardware name: <MANUFACTURER> <PRODUCT>, BIOS <VERSION> <DATE>
Aug 08 00:29:01 host.example.com kernel: Call Trace:
Aug 08 00:29:01 host.example.com kernel:  show_stack+0x52/0x58
Aug 08 00:29:01 host.example.com kernel:  dump_stack+0x70/0x8b
Aug 08 00:29:01 host.example.com kernel:  spl_dumpstack+0x29/0x2b [spl]
Aug 08 00:29:01 host.example.com kernel:  spl_panic+0xd4/0xfc [spl]
Aug 08 00:29:01 host.example.com kernel:  ? do_raw_spin_unlock.constprop.0+0x9/0x10 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? __raw_spin_unlock.constprop.0+0x9/0x10 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? zfs_zevent_post+0x183/0x1c0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? cityhash4+0x8d/0xa0 [zcommon]
Aug 08 00:29:01 host.example.com kernel:  ? abd_verify+0x15/0x70 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? abd_to_buf+0x12/0x20 [zfs]
Aug 08 00:29:01 host.example.com kernel:  arc_buf_destroy+0xe8/0xf0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  arc_read_done+0x213/0x4a0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  zio_done+0x39d/0xdc0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  zio_execute+0x92/0xe0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  taskq_thread+0x236/0x420 [spl]
Aug 08 00:29:01 host.example.com kernel:  ? wake_up_q+0xa0/0xa0
Aug 08 00:29:01 host.example.com kernel:  ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
Aug 08 00:29:01 host.example.com kernel:  kthread+0x12f/0x150
Aug 08 00:29:01 host.example.com kernel:  ? param_set_taskq_kick+0xf0/0xf0 [spl]
Aug 08 00:29:01 host.example.com kernel:  ? __kthread_bind_mask+0x70/0x70
Aug 08 00:29:01 host.example.com kernel:  ret_from_fork+0x22/0x30

One minute afterwards the snapshoting starts, and all the ZFS related tasks start hanging:

Aug 08 00:31:54 host.example.com kernel: INFO: task z_rd_int:482 blocked for more than 120 seconds.
Aug 08 00:31:54 host.example.com kernel:       Tainted: P          IO      5.11.0-25-generic #27-Ubuntu
Aug 08 00:31:54 host.example.com kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message

(with the call stack below, if you think it's relevant let me know and I can post it)

Affected tasks were also dp_sync_taskq and txg_sync which explains why the I/O was hanging (if I may guess z_rd_int is the read interrupt handler, txg_sync writes the transaction group to the disks).

I don't have the pool events, sorry for that.

EDIT: two more things to note. The system average load is about 0.5 (an old laptop running ~10 VMs) and it probably gets high memory pressure on the ARC. I had 8GB huge pages reserved for the VMs and 4GB zfs_arc_max, with 12GB RAM total - so ARC is going to have to fight with the host system (which is not much, the Linux kernel, libvirt and SSH server - I'd guess 100-200MB). I've now reduced the VM huge pages to 7GB, which should reduce the memory pressure.

aerusso commented 3 years ago

I have never had one of these kernel panics, so it may be better to put this in #12014 (which I see you've already posted in -- I'm also subscribed to that bug). The output of zpool status -v and zpool events immediately after a failed zfs send should give us a lot of diagnostic information.

It's reassuring that merely rebooting and scrubbing makes the invalid reads go away, but you may want to set up the ZED, enable all_debug.sh, and set the output directory to something that is permanent. Then, in the event of a crash, you can sift through the old zpool events.

aerusso commented 3 years ago

I can still reproduce this while running #12346.

aerusso commented 3 years ago

I'm going to try to bisect this bug over the course of the next ~months, hopefully. @behlendorf, are there any particularly dangerous commits/bugs I should be aware of lurking between e9353bc2e and 78fac8d92? Any suggestions on doing this monster bisect (each failed test is going to take about ~3 hours, each successful test probably needs ~6 hours to make sure I'm not just "getting lucky" and not hitting the bug)?

wohali commented 3 years ago

Hey there, we're seeing this as well - not sure if I should put it here, or in #12014 .

System specs

TrueNAS 12.0-U5 (FreeBSD 12.2 + OpenZFS 2.0.5)
AMD Epyc 7F52 in an AsRockRack ROMED8-2T motherboard
512GB RAM ECC
Chelsio T580 LP-CR with 2x 40G
4x LSI 9305-16i controllers
60x SAS-12G SSD; all drives enterprise 8TB and 16TB (56 in use)
SuperMicro 4U, 72-bay chassis with redundant PS & 3x BPN-SAS3-216A-N4 backplanes
- No NVMe devices populated

History

Pool was rebuilt in late July/early August from backup as 9 6-disk RAIDZ2 vdevs with encryption, AES-128-GCM, with AMD hardware acceleration.
Data was replicated from the backup server using zfs send/recv.
Server ran for ~3 weeks without issue before we moved the server into production (acting primarily as NFS server; no VM or DB loads.)
Once it became active, we turn on our backup setup which does regular snapshots every 10min using pyznap, which are then zfs send/recv to the backup server.
Within 24 hours, we started seeing the corruption described in this ticket, causing whole machine crashes.
- No kernel panics were observed, but that could be an artefact of the BMC not displaying the crashes.
- The BMC shows no hardware failures, no memory errors, etc.
- The disks report no errors via SMART
- The zpool status command shows no checksum errors
- We did see these errors on zpool status -v:
```
errors: Permanent errors have been detected in the following files:
<0x21c49>:<0x0>
Speedy/[client]/Projects/foo@pyznap_2021-08-25_15:00:00_hourly:<0x1>
Speedy/[client]/Projects/foo@pyznap_2021-08-25_15:00:00_hourly:<0x20>
<0x50a5>:<0x0>
<0xa7d5>:<0x0>
<0xa4de>:<0x0>
<0x191e5>:<0x0>
<0x191e5>:<0x3>
```
- A scrub, interrupted by reboots, eventually reported this without resolving the errors: scan: scrub repaired 0B in 08:15:27 with 0 errors on Tue Aug 24 08:31:40 2021
- We weren't able to confirm at that time that this was a ereport.fs.zfs.authentication failure because we were too busy with DR protocols

Once we removed the load from the system:

We mounted the pool under Ubuntu 21.04 and saw the same output

Back under TrueNAS, zpool events -v confirmed:

Aug 31 2021 20:14:44.609825904 ereport.fs.zfs.authentication
class = "ereport.fs.zfs.authentication"
ena = 0xca1cc9e0cba06c01
detector = (embedded nvlist)
        version = 0x0
        scheme = "zfs"
        pool = 0x6909b6729e67dcf9
(end detector)
pool = "Speedy"
pool_guid = 0x6909b6729e67dcf9
pool_state = 0x0
pool_context = 0x0
pool_failmode = "continue"
zio_objset = 0x65e7
zio_object = 0x0
zio_level = 0xffffffffffffffff
zio_blkid = 0x0
time = 0x612ec5f4 0x24593470
eid = 0x136

After reading this ticket, we were then able to successfully complete a zpool scrub without a crash. The scrub / reboot / scrub process successfully eliminated the errors.
- This didn't work earlier because crashes occurred so often, we could not complete a scrub process without a reboot in the middle.
As a result of reading this ticket we decided to rebuild the pool without encryption. Load has not yet been restored to the system in question.

aerusso commented 3 years ago

@wohali I'm assuming I just don't leave my machine running long enough to experience the crash. I have a some questions:

Can you describe the workload more precisely? I.e., approximately what kind of read/write/file creates are going on? Was the access was primarily/exclusively NFS server? Can you describe what the clients were doing?
When not "in production" was the server primarily just idle?
Do you have any clean reproducer? (I know this is a stretch, but if you get me a reproducer I can do in a VM, I will fix this)

I think it's very curious that this only seems to happen for people with flash memory. I wonder if it's some race in the encryption code that just doesn't really happen on slower media? Or, is it possible that this is just being caught because of the authentication (encryption), and is otherwise silently passing through for unencrypted datasets?

wohali commented 3 years ago

Can you describe the workload more precisely? I.e., approximately what kind of read/write/file creates are going on? Was the access was primarily/exclusively NFS server? Can you describe what the clients were doing?

Yes, exclusively an NFS server. It's /home directories, so access is varied. Under load it sees a constant 40Gbps of traffic.

2, When not "in production" was the server primarily just idle?

Yes

Do you have any clean reproducer? (I know this is a stretch, but if you get me a reproducer I can do in a VM, I will fix this)

No, sorry.

dani commented 3 years ago

I also face this issue. My env is:

Debian Bullseye (Proxmox Backup Server)
Kernel 5.11.22-1-pve
ZFS 2.0.5
The pool is a single, simple vdev, built on top of hardware RAID10 10 SATA HDD (I know it's not the recommended setup, but I have reasons to do it this way)
Encryption is aes-256-gcm

The server is running Proxmox Backup Server and BackupPC to handle backups. The load vary (can be very busy during the night, a bit less during the day), but is mainly random read access. Sanoid is used to manage snapshots, and syncoid to replicate the data every hours to a remote location. The corrupted snap can be either those from sanoid, or from syncoid. Never had any crash though, errors disapear after two scrubs (but as scrub takes almost 3 days, most of the time, new corrupted snap appears during the previous scrub)

isopix commented 3 years ago

Is it dangerous to create (and use) new encrypted pool/datasets using OpenZFS v2.1.1? Or is this bug related only to migrated pools from old (pre 2.x.x) pools?

Blackclaws commented 2 years ago

We're seeing similar issues with zfs snapshots on encrypted datasets that seem to run into permanent errors during syncoid runs with zfs authentication via other users.

For us there is no kernel panic involved, its simply the sends that break and permanent errors appear only in snapshots, never on the actual file system itself. Destroying the snapshots in question allows normal sync operations to resume and is otherwise unproblematic (aside from the fact that we've just lost an hourly snapshot). This is on Almalinux with zfs 2.0.7-1.

Issue appears every few hours on different datasets seemingly at random. Might also be caused by the send operations themselves not really sure how to debug this.

atj commented 2 years ago

For us there is no kernel panic involved, its simply the sends that break and permanent errors appear only in snapshots, never on the actual file system itself. Destroying the snapshots in question allows normal sync operations to resume and is otherwise unproblematic (aside from the fact that we've just lost an hourly snapshot). This is on Almalinux with zfs 2.0.7-1.

Issue appears every few hours on different datasets seemingly at random. Might also be caused by the send operations themselves not really sure how to debug this.

I'm experiencing exactly the same symptoms (i.e. no kernel panics just permanent errors detected in snapshots).

Upgrading from ZFS 2.0.X to 2.1.2 seems to have significantly reduced the frequency of the errors. This is on a Debian Bullseye (formerly Buster) system using znapzend to take automated snapshots of multiple encrypted datasets every 15-30 minutes. Deleting the snapshots and performing 2 scrubs clears the errors.

Blackclaws commented 2 years ago

Upgrading to zfs 2.1.2 and scrubbing has also cleared existing errors that would not clear before. When is zfs 2.1.2 going to be released as stable for RHEL based distros? Right now its only available as testing.

Maltz42 commented 2 years ago

I wonder if this issue is about to impact a lot more systems as admins upgrade over the next few months from Ubuntu LTS 20.04 (OpenZFS 0.8.3) to LTS 22.04 (OpenZFS v2.1.2, as of this writing). This issue is the reason I'm going to hold off on that move. I believe Debian 10 also had an unaffected version, I'm not sure what 11 comes with.

cdluminate commented 2 years ago

The next Debian stable release comes with the latest stable release of zfs before the soft freeze stage (expected to be Feb. 2022).

atj commented 2 years ago

v2.1.4 hit Debian Bullseye backports last week. Upgrading from v2.1.2 to v2.1.4 has completely resolved this issue for me.

With v2.1.2 I was getting 100+ snapshot errors every day and having to continually delete the snapshots and then scrub the pool twice to clear them.

Jeroen0494 commented 2 years ago

I wonder if this issue is about to impact a lot more systems as admins upgrade over the next few months from Ubuntu LTS 20.04 (OpenZFS 0.8.3) to LTS 22.04 (OpenZFS v2.1.2, as of this writing). This issue is the reason I'm going to hold off on that move. I believe Debian 10 also had an unaffected version, I'm not sure what 11 comes with.

Yes it is. Migrated from CentOS 7 to Ubuntu 22.04 and immediately ran into this bug.

Interestingly enough, I'm only experiencing data corruption errors on one of my disk in 2 disk mirror:

# zpool status -v
  pool: data
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Sat May  7 16:35:31 2022
        3.65T scanned at 244M/s, 2.36T issued at 157M/s, 7.97T total
        8M repaired, 29.60% done, 10:23:27 to go
config:

        NAME                                    STATE     READ WRITE CKSUM
        data                                    DEGRADED     0     0     0
          mirror-0                              DEGRADED     0     0     0
            ata-WDC_WD100EFAX-68LHPN0_XXXXX  ONLINE       0     0     0
            ata-WDC_WD100EFAX-68LHPN0_XXXXX  DEGRADED     8     0     0  too many errors  (repairing)

errors: No known data errors

Which is probably why scrubbing the pool immediately fixes the issue for me, at least temporarily. I've decided to scrub the pool from a live Arch Linux image, which has ZFS 2.1.4 unlike Ubuntu which as of writing 2.1.2. But I fear for what will happen if I book my server up again.

I've had checksum errors, but they were fixed in a previous scrub.

Jeroen0494 commented 2 years ago

And know I have issues with both disk:

# zpool status -v
  pool: data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Sat May  7 16:35:31 2022
        7.82T scanned at 301M/s, 3.77T issued at 145M/s, 7.97T total
        8.50M repaired, 47.26% done, 08:26:39 to go
config:

        NAME                                    STATE     READ WRITE CKSUM
        data                                    ONLINE       0     0     0
          mirror-0                              ONLINE       0     0     0
            ata-WDC_WD100EFAX-68LHPN0_JEHD8AVN  ONLINE       0    15     0
            ata-WDC_WD100EFAX-68LHPN0_JEHUZR6N  ONLINE      27     0    42  (repairing)

errors: No known data errors

Retrodynen commented 2 years ago

@Jeroen0494 I'm not sure if you're encountering the same bug when I experience it the read/write/csum counters never go above 0, zpool status would report a few dozen data errors after removing the snapshots and rerunning scrub then the errors are gone

I think your disks are legitimately failing :(

mheubach commented 2 years ago

i can confirm that bug on Proxmox 7.2.1 (Debian Bullseye / ZFS 2.1.4). We do unencrypted sends of encrypted volumes on this machine. on similar machines with either no encrpytion or raw sends we don't see that problem so far. we "introduced" the bug by upgrading to proxmox 7.2. i can confirm the temporary solution with the following steps: reboot scrub reboot scrub at this point the machine is clean for some hours before new snapshots with errors occur.

mheubach commented 2 years ago

In addition to my post two days ago: this also happens to zvols that are not beeing replicated. It also seems it only happens to ZVOLs. Datasets don't suffer that problem. And it happens to ZVOLs more often, that are beeing replicated.

mheubach commented 2 years ago

Another update:

I did some tests on the machine which shows these errors and all in all it looks like some sort of corruption within the in memory list of available snapshots. Scrub has no effect on this error at all. The output of zpool status -v is just misleading.

In order to get rid of the error for at least some hours it is sufficient to export the pool and import the pool again. zfs list -o name -t snapshot -r runs without errors. A reboot of the host includes the process of exporting the pool. So it is the exporting of the pool which fixes the situation and not the reboot.

After some hours the errors come back. I checked the system for memory leaks or anything that changes the system load within the "error free" time frame. There is nothing out of the ordinary.

I set primarycache=none of one zvol in the hope that "no caching of metadata" also means no caching of available snapshots. But this didn't prevent the snapshot list from becoming currupted again after a while.

Running a scrub while having faulty snapshots results in the following error in the kernel log:

kernel:[1570716.285107] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)

From that point anwards several messages about blocked processes z_rd_int_4... occur. Scrub then stalls. Shutting down VMs with zvols result in stuck kvm processes (the guest os shuts down correctly, but the kvm process never gets terminated (even with KILL signal)).

A reboot of the host blocks when the system tries to unmount zfs datasets. -> hardware reset needed

An important information about the command zpool status -v in this context (how I understand it):

Information about faulty in memory entries in the list of snapshot is reported into the zpool information about data errors.
Running a scrub after reboot removes entries about errors that occured before the last scrub not those before the now running scrub.
Running scrub a second time removes the remaining entries about errors.

So the only thing a scrub actually does in the context of this bug is removing information about former errors.

This is really a nasty bug as it repeatedly stops replication form working and always needs a downtime of all affected VMs.

best regards Manfred

Blackclaws commented 2 years ago

@mheubach

Thanks a lot for the detailed investigation. I can confirm that exporting and importing fixes it as I just tried with a pool that showed the issue. This is nevertheless a big issue as exporting / importing a pool means a halt of operations. While not as problematic as a full reboot it is nevertheless unacceptable if you're going for long uptimes and use replication as a backup strategy.

We see this corruption not just on ZVOLs but also on any other dataset. It seems to scale with the amount of r/w access at least that is an observation we made.

Edit: I think we've also never had this happen to datasets that aren't getting periodic snapshots. What I mean is that those datasets have static snapshots, which are synced as well, but those never seem to corrupt. Maybe it has to do with the sync process itself and snapshotting that takes place while a sync is running?

mheubach commented 2 years ago

@Blackclaws: Your edit is a good point - I immediately checked this as we write tons of logs when replicating.

At for e.g. 2022/07/03 21:35 we have one zvol which is getting a snapshot which is replicated with intermediate snapshots. Replication runs fine, but at the next replication run, accessing this last snapshot causes an I/O error and zed logs "class=authentication" for this snapshot.

At about the same point in time (sadly my logs don't show an excact point in time when the send process finishes) an other zvol receives a snapshot which gets the I/O error immediately on creating it:

(daemon.log:Jul 3 21:35:31 pve1 zed: eid=173504 class=authentication pool='pool-ssd1' bookmark=488645:0:0:0)

So what happens here is that a snapshot is being taken already in a corrupt state (this state is only corrupt in memory - exporting and importing the pool fixes the problem). If this really can be a race condition - i don't know. The only thing i can assume for the moment is that at the same time when this snapshot is taken for this specific zvol a send process of another specific zvol is at a point where it is about to finish.

The result are two disfunctional snapshots for both zvols. At this point I cannot say if both snapshots will be existent after exporting and importing the pool. To check this I have to get another downtime from my customer.

By the way: It is no problem to replicate zvols when omitting intermediate snapshots (switch -i instead of capital I)

My next step is to reduce parallel replication jobs to only one job at the same time. I suppose there will still be problems as regular snapshots are being taken every 15 minutes. But maybe it will reduce the number of occurences.

mheubach commented 2 years ago

Another Update: Limiting parallel send processes to only one, and trying to prevent send processes while snapshots are beeing taken reduces the number of occurences of the error. But sooner or later every zvol has a faulty snapshot again.

Blackclaws commented 2 years ago

As @mheubach has noted, this bug can lead to stuck ZFS processes. This has now brought down one of cluster nodes twice within a couple of months. We're seriously considering turning off zfs encryption at this point as this is really problematic for us.

mheubach commented 2 years ago

@Blackclaws: Can you post the output of zpool get all and zfs get all I'd like to compare this to our system Please make sure not to disclose any sensitive information when posting the data (sometimes even pathnames reveal a lot to potential attackers)

mattchrist commented 2 years ago

I think I'm also experiencing this issue. I also use sanoid to take snapshots of encrypted datasets and syncoid to send them to another host. I'm running NixOS, kernel version 5.15.49, openzfs 2.1.5.

My workaround for the issue without rebooting (or export/import) is to delete the snapshots that had I/O errors and rerun syncoid.

Please let me know if there's anything I can do to help troubleshoot or investigate this.

Edit: my issue may be a bit different, this comment says it's only happening on zvols, not datasets: https://github.com/openzfs/zfs/issues/11688#issuecomment-1145659437

I don't have any zvols, and for me this issue only occurs on datasets.

mheubach commented 2 years ago

@mattchrist: I tried to verify this and picked one of the faulty snapshots. Trying to list the snapshot leads to an I/O error:

root@pve1:/var/log/mhrepl# zfs list pool-ssd1/temp/encrypt/vm-108-disk-0@mhrepl-2nd-now
cannot open 'pool-ssd1/temp/encrypt/vm-108-disk-0@mhrepl-2nd-now': I/O error

Trying to delete the snapshot leads to this output:

root@pve1:/var/log/mhrepl# zfs destroy pool-ssd1/temp/encrypt/vm-108-disk-0@mhrepl-2nd-now
could not find any snapshots to destroy; check snapshot names.

For both actions zed logs the following lines:

Jul 12 09:58:16 pve1 zed: eid=765133 class=authentication pool='pool-ssd1' bookmark=998359:0:0:0
Jul 12 09:58:19 pve1 zed: eid=765142 class=authentication pool='pool-ssd1' bookmark=998359:0:0:0

mheubach commented 2 years ago

Today we experienced the first occurance of blocked I/O. The pool became inaccessible after the error. A second pool on this machine continues to work without interruption. A hardware reset was needed to reboot the host.

Kernellog:

[Do Jul 14 06:54:41 2022] BUG: kernel NULL pointer dereference, address: 0000000000000000
[Do Jul 14 06:54:41 2022] #PF: supervisor read access in kernel mode
[Do Jul 14 06:54:41 2022] #PF: error_code(0x0000) - not-present page
[Do Jul 14 06:54:41 2022] PGD 0 P4D 0 
[Do Jul 14 06:54:41 2022] Oops: 0000 [#1] SMP NOPTI
[Do Jul 14 06:54:41 2022] CPU: 60 PID: 1664770 Comm: zfs Tainted: P        W IO      5.15.35-1-pve #1
[Do Jul 14 06:54:41 2022] Hardware name: Wortmann_AG TERRA_SERVER/S2600WFT, BIOS SE5C620.86B.02.01.0010.010620200716 01/06/2020
[Do Jul 14 06:54:41 2022] RIP: 0010:zap_lockdir_impl+0x2b2/0x7d0 [zfs]
[Do Jul 14 06:54:41 2022] Code: 86 d8 00 00 00 00 00 00 00 e8 9a 8c a0 c4 4d 89 ae 98 00 00 00 e9 3b fe ff ff 48 8b 43 18 b9 a1 01 00 00 31 f6 bf 28 01 00 00 <48> 8b 10 48 8b 40 08 48 89 95 58 ff ff ff 48 c7 c2 a8 b3 a0 c0 48
[Do Jul 14 06:54:41 2022] RSP: 0018:ffffb8e907f07b28 EFLAGS: 00010246
[Do Jul 14 06:54:41 2022] RAX: 0000000000000000 RBX: ffff8fb072fea700 RCX: 00000000000001a1
[Do Jul 14 06:54:41 2022] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000128
[Do Jul 14 06:54:41 2022] RBP: ffffb8e907f07be8 R08: 0000008000000000 R09: 0000000000000000
[Do Jul 14 06:54:41 2022] R10: 0000000000000000 R11: ffff8faf1ad079e0 R12: 0000000000000002
[Do Jul 14 06:54:41 2022] R13: ffff8f81c43ed800 R14: 0000000000000000 R15: ffffb8e907f07c60
[Do Jul 14 06:54:42 2022] FS:  00007fa93dc627c0(0000) GS:ffff8fbbffd00000(0000) knlGS:0000000000000000
[Do Jul 14 06:54:42 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Do Jul 14 06:54:42 2022] CR2: 0000000000000000 CR3: 0000003002db8006 CR4: 00000000007726e0
[Do Jul 14 06:54:42 2022] PKRU: 55555554
[Do Jul 14 06:54:42 2022] Call Trace:
[Do Jul 14 06:54:42 2022]  <TASK>
[Do Jul 14 06:54:42 2022]  ? dbuf_read+0x2af/0x5c0 [zfs]
[Do Jul 14 06:54:42 2022]  zap_lockdir+0x8c/0xb0 [zfs]
[Do Jul 14 06:54:42 2022]  zap_lookup+0x50/0x100 [zfs]
[Do Jul 14 06:54:42 2022]  zvol_get_stats+0x4a/0x120 [zfs]
[Do Jul 14 06:54:42 2022]  zfs_ioc_objset_stats_impl.part.0+0xa6/0xe0 [zfs]
[Do Jul 14 06:54:42 2022]  zfs_ioc_snapshot_list_next+0x35b/0x400 [zfs]
[Do Jul 14 06:54:42 2022]  zfsdev_ioctl_common+0x760/0x9e0 [zfs]
[Do Jul 14 06:54:42 2022]  ? _copy_from_user+0x2e/0x60
[Do Jul 14 06:54:42 2022]  zfsdev_ioctl+0x57/0xe0 [zfs]
[Do Jul 14 06:54:42 2022]  __x64_sys_ioctl+0x8e/0xc0
[Do Jul 14 06:54:42 2022]  do_syscall_64+0x59/0xc0
[Do Jul 14 06:54:42 2022]  ? exit_to_user_mode_prepare+0x37/0x1b0
[Do Jul 14 06:54:42 2022]  ? irqentry_exit_to_user_mode+0x9/0x20
[Do Jul 14 06:54:42 2022]  ? irqentry_exit+0x19/0x30
[Do Jul 14 06:54:42 2022]  ? exc_page_fault+0x89/0x160
[Do Jul 14 06:54:42 2022]  ? asm_exc_page_fault+0x8/0x30
[Do Jul 14 06:54:42 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[Do Jul 14 06:54:42 2022] RIP: 0033:0x7fa93e247cc7
[Do Jul 14 06:54:42 2022] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
[Do Jul 14 06:54:42 2022] RSP: 002b:00007ffebbd7bc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Do Jul 14 06:54:42 2022] RAX: ffffffffffffffda RBX: 000055d64c0325a0 RCX: 00007fa93e247cc7
[Do Jul 14 06:54:42 2022] RDX: 00007ffebbd7bcf0 RSI: 0000000000005a15 RDI: 0000000000000003
[Do Jul 14 06:54:42 2022] RBP: 00007ffebbd7bcd0 R08: 0000000000000000 R09: 00007fa93e37dd80
[Do Jul 14 06:54:42 2022] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffebbd7bcf0
[Do Jul 14 06:54:42 2022] R13: 0000000000005a15 R14: 000055d64c0325b0 R15: 00007ffebbd7f310
[Do Jul 14 06:54:42 2022]  </TASK>
[Do Jul 14 06:54:42 2022] Modules linked in: veth tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iscsi_target_mod target_core_mod iptable_filter bpfilter nfnetlink_cttimeout mpt3sas raid_class mptctl bonding mptbase tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common ipmi_ssif isst_if_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ast drm_vram_helper drm_ttm_helper ttm irqbypass irdma drm_kms_helper rapl cec ice rc_core i2c_algo_bit fb_sys_fops mei_me syscopyarea intel_cstate pcspkr efi_pstore ib_uverbs ioatdma acpi_ipmi sysfillrect joydev input_leds sysimgblt mei intel_pch_thermal dca ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4
[Do Jul 14 06:54:42 2022]  zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb hid_generic ses usbkbd usbmouse enclosure scsi_transport_sas usbhid uas hid usb_storage crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd i40e xhci_pci xhci_pci_renesas i2c_i801 megaraid_sas lpc_ich i2c_smbus ahci xhci_hcd libahci wmi
[Do Jul 14 06:54:42 2022] CR2: 0000000000000000
[Do Jul 14 06:54:42 2022] ---[ end trace 9d60617b84301ff1 ]---
[Do Jul 14 06:54:42 2022] RIP: 0010:zap_lockdir_impl+0x2b2/0x7d0 [zfs]
[Do Jul 14 06:54:42 2022] Code: 86 d8 00 00 00 00 00 00 00 e8 9a 8c a0 c4 4d 89 ae 98 00 00 00 e9 3b fe ff ff 48 8b 43 18 b9 a1 01 00 00 31 f6 bf 28 01 00 00 <48> 8b 10 48 8b 40 08 48 89 95 58 ff ff ff 48 c7 c2 a8 b3 a0 c0 48
[Do Jul 14 06:54:42 2022] RSP: 0018:ffffb8e907f07b28 EFLAGS: 00010246
[Do Jul 14 06:54:42 2022] RAX: 0000000000000000 RBX: ffff8fb072fea700 RCX: 00000000000001a1
[Do Jul 14 06:54:42 2022] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000128
[Do Jul 14 06:54:42 2022] RBP: ffffb8e907f07be8 R08: 0000008000000000 R09: 0000000000000000
[Do Jul 14 06:54:42 2022] R10: 0000000000000000 R11: ffff8faf1ad079e0 R12: 0000000000000002
[Do Jul 14 06:54:42 2022] R13: ffff8f81c43ed800 R14: 0000000000000000 R15: ffffb8e907f07c60
[Do Jul 14 06:54:42 2022] FS:  00007fa93dc627c0(0000) GS:ffff8fbbffd00000(0000) knlGS:0000000000000000
[Do Jul 14 06:54:42 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Do Jul 14 06:54:42 2022] CR2: 0000000000000000 CR3: 0000003002db8006 CR4: 00000000007726e0
[Do Jul 14 06:54:42 2022] PKRU: 55555554

mheubach commented 2 years ago

Another weird thing is - zdb is not working on this pool. Can anybody struggling with this bug confirm that zdb is not working on the affected pool?

root@pve1:~# zdb pool-ssd1
zdb: can't open 'pool-ssd1': No such file or directory

ZFS_DBGMSG(zdb) START:
ZFS_DBGMSG(zdb) END

edit: zdb works using the -e switch

mheubach commented 2 years ago

Can we do something like a small survey with the people affected by this bug? Do have encrypted zvols / datasets which you replicate decrypted or do you use raw encrypted zfs sends (switch "-w")?

Maltz42 commented 2 years ago

Raises hand - kind of...

I have encrypted datasets with a nightly send -w / receive to an offsite backup pool. I'm running Ubuntu 20.04 LTS / zfs 0.8.3 and am holding off upgrading to anything newer to avoid this bug. But this is going to have a huge impact on me if it's not fixed in time to make it into Ubuntu 24.04 LTS. It really defeats the entire reason I use ZFS encryption at all - to be able to send/receive without decrypting at the other end.

mattchrist commented 2 years ago

Can we do something like a small survey with the people affected by this bug? Do have encrypted zvols / datasets which you replicate decrypted or do you use raw encrypted zfs sends (switch "-w")?

Encrypted datasets, sending unencrypted (without -w).

openzfs / zfs