Closed t-m-w closed 2 years ago
I tried to copy your setup to see if I could identify the line of code that's crashing, but my addresses don't seem to match up to yours; I must have failed to replicate your setup in some way. Could you send me either your zfs kernel module or the contents of objdump -D
of it?
Could you send me either your zfs kernel module or the contents of
objdump -D
of it?
The kernel module is from https://launchpad.net/~jonathonf/+archive/ubuntu/zfs -- here is a zip of everything in /lib/modules/5.11.0-18-generic/kernel/zfs: zfs.zip
Actually, I think I shared the wrong modules and you need the stuff from the updates/dkms folder...?
I took the opportunity to try and build the zfs package with --enable-debug. I'm not sure if it worked. All I did was an apt source zfs-linux
to get the source from the ppa, then edited debian/rules to add --enable-debug after every ./configure line (2 of them) and after a dh_auto_configure line, then ran dpkg-buildpackage -us -uc
and installed the resulting packages, except zfs-dracut. Also, I'm on kernel 5.11.0-22-generic now, so I repeated the process.
Here's those modules again (hopefully this is what you need): zfs.zip
Here's the full output of /dev/kmsg including after running echo t > /proc/sysrq-trigger
: zfs-error-kmsg.log
OK, this'll be my last of several posts in a row in a short time, just to say that I've updated the issue with new information and with a link to download the affected zpool as a sparse image, with unrelated datasets removed, which will hopefully be helpful in solving this (and allow me to remove the zpool/dataset/snapshot from my system if needed). I recommend disregarding my previous two comments.
There seem to be multiple problems here. dmu_dump_write() can be called with bp=NULL if it cannot send large_blocks and the buffer size is more than 128kb (so it tries to split the data in chunks). Which actually causes the panic in your case as it tries to dereference it if sending raw data.
Now if we circumvent that and instrument dmu_send.c using this patch:
diff --git a/module/zfs/dmu_send.c b/module/zfs/dmu_send.c
index d65438223..4f2e7c581 100644
--- a/module/zfs/dmu_send.c
+++ b/module/zfs/dmu_send.c
@@ -487,7 +487,7 @@ dmu_dump_write(dmu_send_cookie_t *dscp, dmu_object_type_t type, uint64_t object,
drrw->drr_logical_size = lsize;
/* only set the compression fields if the buf is compressed or raw */
- if (raw || lsize != psize) {
+ if (lsize != psize || raw) {
ASSERT(raw || dscp->dsc_featureflags &
DMU_BACKUP_FEATURE_COMPRESSED);
ASSERT(!BP_IS_EMBEDDED(bp));
@@ -1007,12 +1007,14 @@ do_dump(dmu_send_cookie_t *dscp, struct send_range *range)
* don't allow us to send large blocks, we split the data from
* the arc buf into chunks.
*/
+ cmn_err(CE_NOTE, "datablksz=%lu, datasz=%lu", srdp->datablksz, srdp->datasz);
if (srdp->datablksz > SPA_OLD_MAXBLOCKSIZE &&
!(dscp->dsc_featureflags &
DMU_BACKUP_FEATURE_LARGE_BLOCKS)) {
while (srdp->datablksz > 0 && err == 0) {
int n = MIN(srdp->datablksz,
SPA_OLD_MAXBLOCKSIZE);
+ cmn_err(CE_NOTE, "n=%d", n);
err = dmu_dump_write(dscp, srdp->obj_type,
range->object, offset, n, n, NULL, data);
offset += n;
then fletcher_native() panics:
[ 491.511987] NOTICE: datablksz=1048576, datasz=393216
[ 491.511989] NOTICE: n=131072 [ 491.512112] NOTICE: n=131072
[ 491.512347] NOTICE: n=131072
[ 491.512503] NOTICE: n=131072
[ 491.512512] BUG: unable to handle page fault for address: ffffaefa898be000
[ 491.513186] #PF: supervisor read access in kernel mode
[ 491.513624] #PF: error_code(0x0000) - not-present page
[ 491.514120] PGD 100000067 P4D 100000067 PUD 1001b1067 PMD 1203a4067 PTE 0
[ 491.514651] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 491.515001] CPU: 8 PID: 3210 Comm: lt-zfs Tainted: P OE 5.12.15-arch1-1 #1 [ 491.515713] Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS ELITE/Z390 AORUS ELITE-CF, BIOS F10g 09/16/2020 [ 491.516677] RIP: 0010:fletcher_4_avx2_native+0x38/0x80 [zcommon] [ 491.517225] Code: 16 53 48 89 f3 e8 28 ff ff ff c4 c1 7e 6f 04 24 c4 c1 7e 6f 4c 24 20 c4 c1 7e 6f 54 24 40 c4 c1 7e 6f 5c 24 60 48 39 eb 73 1e <c4> e2 7d 35 23 c5 fd d4 c4 c5 f5 d4 c8 c5 ed d4 d1 c5 e5 d4 da 48
[ 491.519221] RSP: 0018:ffffaefa893a3520 EFLAGS: 00010006
[ 491.519706] RAX: 0000000000000000 RBX: ffffaefa898be000 RCX: 0000000000000000
[ 491.520364] RDX: 00000000ffffffff RSI: ffffaefa898bd000 RDI: ffff98288f295000
[ 491.521008] RBP: ffffaefa898dd000 R08: 0000000000020000 R09: ffff982882bdb020 [ 491.521618] R10: 0000000000000002 R11: ffff98288a42c710 R12: ffffaefa893a3540
[ 491.522179] R13: ffffaefa898bd000 R14: 0000000000020000 R15: ffffaefa893a3960
[ 491.522640] FS: 00007fd9df5327c0(0000) GS:ffff9828fbc00000(0000) knlGS:0000000000000000
[ 491.523203] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 491.523593] CR2: ffffaefa898be000 CR3: 000000011b43e005 CR4: 0000000000370ee0
[ 491.524080] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 491.524567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 491.525049] Call Trace:
[ 491.525229] fletcher_4_native_impl+0x62/0xa0 [zcommon]
[ 491.525580] ? __kernel_write+0x2c3/0x310
[ 491.525907] fletcher_4_native+0xd0/0xf0 [zcommon]
[ 491.526375] fletcher_4_incremental_impl+0x57/0xc0 [zcommon]
[ 491.526789] fletcher_4_incremental_native+0x2f/0x40 [zcommon]
[ 491.527157] dump_record+0xbb/0x200 [zfs]
Notice the:
[ 491.511987] NOTICE: datablksz=1048576, datasz=393216
Is it possible to split in chunks when raw sending an encrypted dataset?
It seems that if we split in chunks based on datasz instead of datablksz (in do_dump()
) in case of a raw encrypted send, then the send succeeds.
After more careful consideration of the current codebase it seems it's not possible to raw send in small chunks.
@t-m-w if this issue is still relevant to you, could you try a normal (non-raw) zfs send rpool_borked/CORRUPTED_media@zfs-auto-snap_weekly-2021-05-09-1348
with the keys loaded? That should probably work.
@gamanakis Yes, I can confirm that a non-raw send works just fine and does not cause this problem. If you'd like me to privately share any of the decrypted data for testing or troubleshooting, let me know, or if there's anything else you need me to try, I'd be happy to do so.
I have the same problem affecting 3 (out of 81) filesystems on my FreeBSD system - see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=259093
I've been doing some digging through both my crashdumps and using dtrace
on my live system. The immediate trigger path is dmu_send_obj()
calls dsl_dataset_hold_obj_flags()
calls dsl_dataset_hold_obj()
, which iterates through load_zfeature()
for each per-dataset feature (which is org.open-zfs:large_blocks
, org.zfsonlinux:large_dnode
, org.illumos:sha512
, org.illumos:skein, org.illumos:edonr
, org.zfsonlinux:userobj_accounting
, com.datto:encryption
, org.zfsonlinux:project_quota
, org.freebsd:zstd_compress
and com.datto:ivset_guid
) to initialise dspp.to_ds->ds_feature
. dmu_send_obj()
then calls dmu_send_impl()
, which calls setup_featureflags()
to translate the to_ds.ds_feature
array into flag bits in featureflags
. For the affected filesystem, org.open-zfs:large_blocks
is incorrectly false on the dataset (thought it's correctly "active" on the pool) and therefore SPA_FEATURE_LARGE_BLOCKS
is not set in the to_ds.ds_feature
array, hence DMU_BACKUP_FEATURE_LARGE_BLOCKS
is not set in featureflags
even though large blocks exist in the dataset.
Working the other way, the only place where SPA_FEATURE_LARGE_BLOCKS
is set is in dsl_dataset_block_born()
, which marks it as needing activation if (BP_GET_LSIZE(bp) > SPA_OLD_MAXBLOCKSIZE)
. The "activation" is then converted to "active" in dsl_dataset_sync()
if it's not already marked active. I haven't yet traced the actual path to see where that code is failing.
If that's the case, the bug here isn't in the send logic at all, it would appear to be in the per-dataset feature logic or the large block logic somewhere. I don't know as much about those codepaths... @ahrens or @don-brady might know more?
I have spent some time with dtrace
and, whilst I can see ds_feature_activation
being set in dsl_dataset_block_born()
, I can't find the matching dsl_dataset_sync()
. Possibly this is just lack of dtrace-foo but it looks very much like there are code paths where dsl_dataset_sync()
is missing.
@peterjeremy Nice analysis of where the problem is occurring. Are you saying that after ds_feature_activation is set, dsl_dataset_sync() is not called on this dataset? Or that it's called but it doesn't call dsl_dataset_activate_feature()?
If dsl_dataset_block_born() is called, the dataset should already be "dirty" (i.e. on dp_dirty_datasets), and therefore dsl_pool_sync() should call dsl_dataset_sync() on it. If that doesn't seem to be the case, then we could try adding an assertion to dsl_dataset_block_born() asserting that the dataset is dirty. There might be some corner cases where that isn't true, some of which might be fine but it's possible that there's a bug here.
@gamanakis
After more careful consideration of the current codebase it seems it's not possible to raw send in small chunks
That's right. We should be either failing that as an error, or (perhaps better) saying that raw send implies --large-blocks
(whether encrypted or not).
There is the following codepath:
dsl_dataset_sync() -> dmu_objset_sync() -> zio_nowait()
In the zio_nowait()
of dmu_objset_sync()
, the callback function is:
dmu_objset_write_done() -> dsl_dataset_block_born()
.
My question is, is it possible here that dsl_dataset_sync()
completes before zio_nowait()
calls dsl_dataset_block_born()
? There seems to be a race between dsl_dataset_sync()
and the callback dmu_objset_write_done() -> dsl_dataset_block_born()
.
System information
UPDATE: The version of zfs is the official Ubuntu version. I initially posted this when I was using a ppa; I've since returned to the official Ubuntu zfs and experience the same problem, but I've provided new logs and info in this post after changing versions. I've also provided a download for an affected zpool.
Describe the problem you're observing
Every time I run a
zfs send
that involves a particular encrypted snapshot, I receive a NULL pointer dereference in the kernel. If also usingzfs receive
, it stops all future-executed zfs commands from responding and prevents the computer from rebooting cleanly. In some cases, the involved send process may be 'Killed'. The pool remains busy and cannot be exported, and further instability/hung behavior may result.Describe how to reproduce the problem
Download the compressed sparse image of my affected zpool here: https://drive.google.com/file/d/1SFL-OMob8DRTJJIsq2O77SkLy3knptaJ/view?usp=sharing sha256sum: afbc980346d75808daf8cc7a3d85a27290c09c35010b39670103d91c43afcfc3
sudo tar -C /root -xvf rpool_borked.tar.xz rpool_borked.img
sudo zpool import -N rpool_borked -o readonly=on -d /root/rpool_borked.img
sudo dmesg --follow
sudo zfs send -v -w rpool_borked/CORRUPTED_media@zfs-auto-snap_weekly-2021-05-09-1348 > send.bin
zpool export rpool_borked
Further notes
feature@large_blocks is activefeature@large_blocks is currently actually just enabled, not active, for this affected zpool, in case that is relevant. But it was active at one point, before other datasets and snapshots were destroyed. Further down in this issue, you can see the status of other features/properties.Include any warning/errors/backtraces from the system logs
Kernel output, original server, from a
zfs send
performed from within dropbear-initramfs (minimal environment):On a different, fully-booted system, I also experienced this problem using the provided image as described in the reproduction steps, but piped to zstreamdump:
zfs send -v -w rpool_borked/CORRUPTED_media@zfs-auto-snap_weekly-2021-05-09-1348 | zstreamdump -vvvvv
Output of that command is here:Its dmesg output is below:
zpool properties for rpool_borked:
dataset properties for rpool_borked/CORRUPTED_media:
snapshot properties for rpool_borked/CORRUPTED_media@zfs-auto-snap_weekly-2021-05-09-1348:
Full output of /dev/kmsg in the test from initramfs, including
echo t > /proc/sysrq-trigger
: kmsg-full.logSorry if more of this should have been attachment rather than inline; I just figured inline would be more convenient.