openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.45k stars 1.73k forks source link

Unable to receive encrypted raw send #12720

Closed putnam closed 2 years ago

putnam commented 2 years ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version bookworm/testing
Kernel Version 5.14.0-2
Architecture amd64
OpenZFS Version 2.0.6-1

Describe the problem you're observing

I am trying to transfer an encrypted dataset to another pool using a raw send. Partially through this send, the receive side fails with an unclear error message.

The receiving system is also on 2.0.6, running Debian Bullseye on kernel 5.10.0-9.

Here are the commands that reproduce the issue reliably:

source# zfs send -Lcw -v tank/enc/private@zfs-auto-snap_monthly-2020-07-01-1152 | mbuffer -O 10.200.200.2:13337

dest# mbuffer -I 13337 | zfs receive -uv archive/dwight/tank/private

The dataset is about 1.3TB; at the 360GB mark, the receive reliably fails with the following error:

cannot receive new filesystem stream: invalid backup stream

Oddly, if you provide the -s flag to the recv (enable resume support) the error changes to this:

cannot receive new filesystem stream: kernel modules must be upgraded to receive this stream.

I tried sending the stream through zstreamdump, which did not complain. I also maximized debug logging, generated a resume token and tried resuming (right at whatever transaction caused the recv to fail), which generated these messages (gist). It's noisy; see line 1786 for when receive_object failed.

I don't know the exact steps to reproduce this problem with another dataset. Originally this dataset was on another pool in an unencrypted state. It was sent into the encrypted pool just fine, then I wanted to raw-send the newly-encrypted dataset offsite to another pool. It is at this point that I ran into this show-stopper.

See also the gist below which is the output of zstreamdump -vvv on the resumed send.

https://gist.github.com/putnam/3a467fdab6fe96ca048264c1bcd8a4b0

UPDATE: In the discussion below, I was able to shrink the affected dataset down to a ~70MB repro. See this comment for an attachment.

putnam commented 2 years ago

I'm actually not sure what to do here. I can't send a raw backup, full stop. And if I send it non-raw, I can't switch the incrementals to be raw because the IVs won't be the same. What am I meant to do in this situation? I will try running 2.1.0 in a VM or something on the target host to see if it makes any difference.

rincebrain commented 2 years ago

In theory, that message isn't supposed to happen unless you're sending from a newer version than being received on, and somehow it didn't notice a missing feature required until later, at which point the "fix" is pretty obvious.

In practice, that message comes out for all sorts of edge cases, and could probably stand to be refined, though I'd personally just suggest changing it to say something like "Either you need to upgrade your receiver to a newer version or, if they're not different (or the receiver is newer), you've hit an edge case we aren't handling, please report a bug."

Would you possibly be willing to share a full (e.g. not solely the resume portion, though that might be helpful in addition to the whole) raw (e.g. encrypted data blocks, though not some metadata) send stream with someone to debug? I'm probably not an ideal candidate, not being familiar with the over the wire format of send/recv or all the places encrypted recv touches, but it'd probably be very helpful if someone could repro it locally and iterate on whatever's going awry rather than having to round trip to you for testing each time. (It's perfectly okay to say no if you're not comfortable;; it would just be useful to know )

I suspect it'll just say "lol it's encrypted go pound sand", from recollection, but what do zdb -dbdbdbdbdbdb tank/enc/private@zfs-auto-snap_monthly-2020-07-01-1152 269839 and [...] 269838 have to say?

You said the other system is on bullseye with 2.0.6, is that from backports? Could you give us the full output of zfs version on both hosts, just for completeness?

Also just for completeness, could you try receiving the stream locally somewhere on the sender to see if it errors the same way?

One final remark - while you're correct that you couldn't go from non-raw to raw send, you could still do non-raw send and enable encryption on the receiver like with any non-raw send being received. This of course doesn't work if you don't trust the receiver to have access to the encryption key, but that may or may not be an issue in your environment.

putnam commented 2 years ago

Thanks for your reply, @rincebrain

I'll try to address everything in order.

Sharing a full send is going to be hard. It's 1.3TB, and it's a dataset I'd prefer not to directly share anyway. But I realize that it may only repro on this exact snapshot on this exact dataset. I don't mind the back-and-forth, though.

Attached is the output of zdb for each of those objects as requested: zdb_269838.txt zdb_269839.txt

The target system is on 2.0.6 via backports.

Source zfs version:

zfs-2.0.6-1
zfs-kmod-2.0.6-1

Target zfs version:

zfs-2.0.6-1~bpo11+1
zfs-kmod-2.0.6-1~bpo11+1

I just tried sending the stream locally from source to a new dataset with the same error at the same position:

# zfs send -Lcw tank/enc/private@zfs-auto-snap_monthly-2020-07-01-1152 | mbuffer | zfs recv -u tank/privatetest
in @  128 MiB/s, out @  0.0 kiB/s,  360 GiB total, buffer  10% full
cannot receive new filesystem stream: invalid backup stream

As a side note, the performance of zfs send here has been relatively poor. Before I switched to encryption I could zfs send at a rate of about 1-1.5 GiB/second. Unlike sequential scrubs, zfs send seems to be IOP bound. I'm not sure what makes the encrypted version so much slower. The pool topology on the source machine is 5x11 raidz2. Target is 3x12 raidz2. All are spinning disks between 12T-18T in size.

Regarding the ability to still send a backup unencrypted, it is indeed the case that I want raw sends to work since they'll be going to an untrusted offsite system.

rincebrain commented 2 years ago

Sharing a full send is going to be hard. It's 1.3TB, and it's a dataset I'd prefer not to directly share anyway. But I realize that it may only repro on this exact snapshot on this exact dataset. I don't mind the back-and-forth, though.

Okay, I just figured it'd be worth asking (though it'd also be interesting to see if you could create a clone with only the minimal data necessary to reproduce - possibly by doing a zfs clone of the snapshot, finding just the specific files those IDs I asked you for the zdb data for map to, and nuking everything but them and their containing dirs...you could find the specific files with find /path/to/mounted/snapshot/or/clone -type f -inum [the number in question], though this may take a bit. It's also likely to be error prone, so mayhaps clone, snapshot the clone, verify the clone snapshot still errors on send in the same place, then nuke everything but the entire unmodified directory and parents of those objects, then snapshot that and make sure the snapshot still reproduces, and so on...)

I don't know of any performance bottlenecks specific to only raw send - I'd be curious to check out a flamegraph of send -Lec versus send -w on the same dataset over, say, sleep 120 - it might be IO bound, but I'd like to see if there's any visibly different callstacks that turn up...

RAIDZs aren't particularly made of IOPs in general - what makes you think it's IOP bound and not CPU or other bound? (If it's "it's in iowait a bunch", that could mean many things.)

Yeah, I'm not surprised that unencrypted doesn't work for you, just thought I'd ask. :/

putnam commented 2 years ago

Good idea on the clone strategy. Worked a charm, repros now with a 73M dataset.

Here are my notes as I did this, just for reference:

# Created clone
zfs clone tank/enc/private@zfs-auto-snap_monthly-2020-07-01-1152 tank/privatetest

# Initial clone snapshot (no modifications)
zfs snapshot tank/privatetest@clonesnap

# Sending full clone triggers same error at same position
zfs send -Lcw tank/privatetest@clonesnap |mbuffer|zfs recv -u tank/privatetest_receive
in @  128 MiB/s, out @  0.0 kiB/s,  360 GiB total, buffer  21% full
cannot receive new filesystem stream: invalid backup stream

# Whittling down filesystem to only affected inodes
find /tank/privatetest -type f -inum 269838 -or -inum 269839

# deleted all other folders, size is now 285M
zfs snapshot tank/privatetest@afterdeletion

# tried send again, failed again (faster this time)
zfs send -Lcw tank/privatetest@afterdeletion | zfs recv -u tank/privatetest_receive
cannot receive new filesystem stream: invalid backup stream

# deleted sibling files to minimize size, snapped, tested again (73M)
zfs snapshot tank/privatetest@afterdeletion2
zfs send -Lcw tank/privatetest@afterdeletion2 | zfs recv -u tank/privatetest_receive
cannot receive new filesystem stream: invalid backup stream

Here is a dropbox link of the zfs send: https://www.dropbox.com/s/p7yykx47syl7jm5/zfs_send_culled_private.bin.zip?dl=0

You're right that I sort of assumed it was IOP-bound simply because of frequent iowait on zfs send. CPU usage is only 15-20% of one core. I will definitely dig into this separately once I get past the send issue.

putnam commented 2 years ago

To add a little more context if it matters, here is this dataset's zfs send/recv "provenance". Originally it was unencrypted on a single host running normally. When I prepared the offsite, I wanted to also take this opportunity to transfer the data back-and-forth to get encryption on both hosts and also rebalance data across vdevs.

The unencrypted dataset was sent to the backup machine into an encrypted dataset, then sent back into the primary pool so it would be encrypted -- I did not use a raw send here, though, and in order for raw sends to work on the offsite I needed to once again wipe the pool on the offsite and now send the dataset with the raw flag.

It would have been much easier to have used a raw send, but there is a long standing bug that directly affected my workflow here. Once I sent the dataset to the offsite, I needed to briefly wipe the source pool and send everything back, but I kept the source system online by mounting the offsite over nfs. This resulted in the dataset changing and needing to be sent back, which is affected by that bug. I confirmed it by dry-running before I started this journey and didn't want to take a risk. So I decided to do an extra transfer once everything was situated on the primary host so the raw sends can start clean and never risk being modified by the offsite and being sent back.

tl;dr:

  1. Dataset began unencrypted on SOURCE (2.0.6)
  2. Dataset was sent into a new, encrypted dataset on TARGET (2.0.3)
  3. Source pool destroyed and recreated. Dataset was then sent back from TARGET to SOURCE, with a new encryption key.
  4. Finally, attempted to replicate SOURCE to TARGET again using a raw send, which led to this ticket. In this process I updated TARGET to 2.0.6 just to see if it affected anything, which it did not.

The most recent zfs send to work fine was #3. The affected files are not new at all. Old stuff.

rincebrain commented 2 years ago

Just to be clear, did you send just this dataset, or was it part of a hierarchy of datasets being sent around?

You could always try cherrypicking/un-reverting d1d47691c256cd7e15af7819ea33024e6693e9c5 and seeing if it helps your life.

putnam commented 2 years ago

Only this dataset in this case. Not a hierarchy.

On Thu, Nov 11, 2021 at 4:07 PM Rich Ercolani @.***> wrote:

Just to be clear, did you send just this dataset, or was it part of a hierarchy of datasets being sent around?

You could always try cherrypicking/un-reverting d1d4769 https://github.com/openzfs/zfs/commit/d1d47691c256cd7e15af7819ea33024e6693e9c5 and seeing if it helps your life.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/12720#issuecomment-966659299, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASBT6QMF2ZHMRNQCWIEC3ULQ5AHANCNFSM5HH4C6DQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

putnam commented 2 years ago

I'm a little surprised -- is nobody working on bugs on raw sends? I can reliably reproduce this and provided an easy repro case. I'd definitely like to dig in myself but it's a lot to spin up. Maybe I hit some edge case as I'm surprised nobody else seems to have run into this. As it stands I'm just waiting until 2.1.1 lands in Debian (the packager finally uploaded to experimental a week ago) to try to test further.

rincebrain commented 2 years ago

Not quite nobody, no. Just almost.

I've failed to convince anybody else so far to work on these, but I'm trying to knock them out one at a time, unfortunately slowly.

I believe the root cause(s) for some of the bugs that work similarly to this one are understood, just no fix written and remaining committed yet.

As far as how this situation arose, it seems it works well for the few big companies that use it, and there haven't been many in the intersection of {willing to try fixing, uses native encryption} - even I don't meet that criteria, I don't use it normally.

I'm displeased by this, but I'm already doing all I can about it.

On Wed, Dec 8, 2021, 4:38 PM Chris Putnam @.***> wrote:

I'm a little surprised -- is nobody working on bugs on raw sends? I can reliably reproduce this and provided an easy repro case. I'd definitely like to dig in myself but it's a lot to spin up. Maybe I hit some edge case as I'm surprised nobody else seems to have run into this. As it stands I'm just waiting until 2.1.1 lands in Debian (the packager finally uploaded to experimental a week ago) to try to test further.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/12720#issuecomment-989216949, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7NITOV47BODKQ6VEGLUP7F5VANCNFSM5HH4C6DQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

bghira commented 2 years ago

I'm a little surprised -- is nobody working on bugs on raw sends?

for what it's worth, this feature has been broken since it landed and it's received no love from the dev team. Datto contributed it but doesn't wish to contribute to its maintenance. this is one reason I'm skeptical of all the new features (dRAID, Object storage) - they're running the BTRFS playbook, adding feature after feature without validating the existing functionality. all of that, on top of major reorganisation of code as platforms consolidate on OpenZFS.

i honestly don't blame anyone who doesn't want to upgrade from 0.8.x

putnam commented 2 years ago

I have updated to 2.1.1 after it landed in Debian unstable.

The send stream generated by 2.1.1 is identical to the send stream generated by 2.0.6. The operation fails in the same way:

$ cat /sys/module/zfs/version 
2.1.1-3

$ zfs send -Lcw tank/privatetest@afterdeletion2 > zfs_send_culled_private_2.1.1.bin

$ diff -s zfs_send_culled_private.bin zfs_send_culled_private_2.1.1.bin 
Files zfs_send_culled_private.bin and zfs_send_culled_private_2.1.1.bin are identical

# (this is the equivalent of downloading my repro and catting it to zfs recv)
$ zfs send -Lcw tank/privatetest@afterdeletion2 | zfs recv -u tank/privatetest_receive
cannot receive new filesystem stream: invalid backup stream
gamanakis commented 2 years ago

@putnam do you have a reproducer for this one?

putnam commented 2 years ago

Yes, please see this comment above: https://github.com/openzfs/zfs/issues/12720#issuecomment-964677615

You can grab the ~70MB repro via the Dropbox link at the bottom. Try receiving that into a new dataset.

gamanakis commented 2 years ago
4849                case EINVAL:
4850                         if (flags->resumable) {
4851                                 zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
4852                                     "kernel modules must be upgraded to "
4853                                     "receive this stream."));
4854                         } else if (embedded && !raw) {
4855                                 zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
4856                                     "incompatible embedded data stream "
4857                                     "feature with encrypted receive."));
4858                         }
4859                         (void) zfs_error(hdl, EZFS_BADSTREAM, errbuf);
4861                         break;

It's the above part of zfs_receive_one() that is producing the error. The EINVAL comes from receive_objsect():

1588         if (rwa->raw) {
1589                 /*
1590                  * We should have received a DRR_OBJECT_RANGE record
1591                  * containing this block and stored it in rwa.
1592                  */
1593                 if (drro->drr_object < rwa->or_firstobj ||
1594                     drro->drr_object >= rwa->or_firstobj + rwa->or_numslots ||
1595                     drro->drr_raw_bonuslen < drro->drr_bonuslen ||
1596                     drro->drr_indblkshift > SPA_MAXBLOCKSHIFT ||
1597                     drro->drr_nlevels > DN_MAX_LEVELS ||
1598                     drro->drr_nblkptr > DN_MAX_NBLKPTR ||
1599                     DN_SLOTS_TO_BONUSLEN(dn_slots) <
1600                     drro->drr_raw_bonuslen) {
1601                         return (SET_ERROR(EINVAL));

Now there are a bunch of possibilities here, let's see which one is failing.

gamanakis commented 2 years ago
drro->drr_raw_bonuslen < drro->drr_bonuslen

is failing, leading to the error. Let me see if I can figure out what is going on.

The drr_raw_bonuslen is set in dump_dnode():

760                 /*
 761                  * Since we encrypt the entire bonus area, the (raw) part
 762                  * beyond the bonuslen is actually nonzero, so we need
 763                  * to send it.
 764                  */
 765                 if (bonuslen != 0) {
 766                         drro->drr_raw_bonuslen = DN_MAX_BONUS_LEN(dnp);
 767                         bonuslen = drro->drr_raw_bonuslen;
 768                 }

These parts of dmu_recv.c and dmu_send.c, all related to raw_bonuslen are part of commit b525630342.

gamanakis commented 2 years ago

Inspecting the raw dataset from @putnam with cat ~/zfs_send_culled_private.bin | zstreamdump -vvv | grep bonuslen | less gives:

OBJECT object = 1 type = 21 bonustype = 0 blksz = 512 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 3
OBJECT object = 2 type = 45 bonustype = 0 blksz = 512 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 3
OBJECT object = 3 type = 22 bonustype = 0 blksz = 1536 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 3
OBJECT object = 4 type = 20 bonustype = 44 blksz = 512 bonuslen = 168 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 5 type = 46 bonustype = 0 blksz = 1536 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 3
OBJECT object = 6 type = 47 bonustype = 0 blksz = 16384 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 1 indblkshift = 17 nlevels = 1 nblkptr = 3
OBJECT object = 7 type = 20 bonustype = 44 blksz = 16384 bonuslen = 168 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 1 indblkshift = 17 nlevels = 2 nblkptr = 1
OBJECT object = 269838 type = 19 bonustype = 44 blksz = 131072 bonuslen = 192 dn_slots = 1 raw_bonuslen = 192 flags = 4 maxblkid = 281 indblkshift = 17 nlevels = 2 nblkptr = 1
OBJECT object = 269839 type = 19 bonustype = 44 blksz = 131072 bonuslen = 288 dn_slots = 1 raw_bonuslen = 192 flags = 4 maxblkid = 296 indblkshift = 17 nlevels = 2 nblkptr = 1
OBJECT object = 327985 type = 20 bonustype = 44 blksz = 16384 bonuslen = 168 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 1 indblkshift = 17 nlevels = 2 nblkptr = 1
OBJECT object = 331405 type = 20 bonustype = 44 blksz = 512 bonuslen = 168 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 389480 type = 20 bonustype = 44 blksz = 11776 bonuslen = 288 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 464994 type = 20 bonustype = 44 blksz = 8192 bonuslen = 288 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 464996 type = 20 bonustype = 44 blksz = 1536 bonuslen = 288 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 1

Which shows the problematic one: OBJECT object = 269839 type = 19 bonustype = 44 blksz = 131072 bonuslen = 288 dn_slots = 1 raw_bonuslen = 192 flags = 4 maxblkid = 296 indblkshift = 17 nlevels = 2 nblkptr = 1

The drr_raw_bonuslen is calculated in dnode.h:

124 #define DN_MAX_BONUS_LEN(dnp) \
125         ((dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) ? \
126         (uint8_t *)DN_SPILL_BLKPTR(dnp) - (uint8_t *)DN_BONUS(dnp) : \
127         (uint8_t *)(dnp + (dnp->dn_extra_slots + 1)) - (uint8_t *)DN_BONUS(dnp))

and since we know that the flag SPILL_BLKPTR is set (see zdb dump in https://github.com/openzfs/zfs/issues/12720#issuecomment-963927644), we have:

126         (uint8_t *)DN_SPILL_BLKPTR(dnp) - (uint8_t *)DN_BONUS(dnp) : \
gamanakis commented 2 years ago

Right now, I cannot see how drro->drr_raw_bonuslen < drro->drr_bonuslen could happen, since the raw bonuslen is always set to DN_MAX_BONUS_LEN.

@putnam If you would absolutely like to try to retrieve the data using raw receiving you could try the following patch (with every precaution, I was able to raw receive that dataset in a new pool):

diff --git a/module/zfs/dmu_recv.c b/module/zfs/dmu_recv.c
index f132219c9..5349e51d5 100644
--- a/module/zfs/dmu_recv.c
+++ b/module/zfs/dmu_recv.c
@@ -1592,7 +1592,6 @@ receive_object(struct receive_writer_arg *rwa, struct drr_object *drro,
                 */
                if (drro->drr_object < rwa->or_firstobj ||
                    drro->drr_object >= rwa->or_firstobj + rwa->or_numslots ||
-                   drro->drr_raw_bonuslen < drro->drr_bonuslen ||
                    drro->drr_indblkshift > SPA_MAXBLOCKSHIFT ||
                    drro->drr_nlevels > DN_MAX_LEVELS ||
                    drro->drr_nblkptr > DN_MAX_NBLKPTR ||
gamanakis commented 2 years ago

So, we have a dnode:

 OBJECT object = 269839 type = 19 bonustype = 44 blksz = 131072 bonuslen = 288 dn_slots = 1 raw_bonuslen = 192 flags = 4 maxblkid = 296 indblkshift = 17 nlevels = 2 nblkptr = 1

with 1 slot, meaning that it is 512 bytes, leaving a space of 448 bytes (512-64). It also has 1 blkptr = 128 bytes. and, because it has the SPILL_BLKPTR set, a spill pointer = 128 bytes. Which means that the bonuslen should be 192 and not 288 as stored in the zfs send stream.

A bonuslen of 288 means: 288 bonus + 64 + 128 (1 blkptr) = 480 bytes, leaving only 512-480=32 bytes for a spill pointer whose size is 128 bytes, making this impossible.

I think this means that probably something was wrong with setting/clearing of DNODE_FLAG_SPILL_BLKPTR or setting the bonuslen of the dnode. In a 512 byte dnode we cannot have a bonuslen of 288 bytes when the SPILL_BLKPTR flag is set.

gamanakis commented 2 years ago

I just came across commit caf9dd209fdcfccabc2f32b3f23c5386ccfb896c which may be highly relevant. @putnam what is your zfs parameter "zfs_send_unmodified_spill_blocks" set to?

EDIT: this commit and zfs parameter don't apply to raw sends.

putnam commented 2 years ago

Thank you for digging into this!

The param is set to "1" for me. AFAIK this is the default(?)

On Sat, Jan 22, 2022 at 3:04 PM George Amanakis @.***> wrote:

I just came across commit caf9dd2 https://github.com/openzfs/zfs/commit/caf9dd209fdcfccabc2f32b3f23c5386ccfb896c which may be highly relevant. @putnam https://github.com/putnam what is your zfs parameter "zfs_send_unmodified_spill_blocks" set to?

— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/12720#issuecomment-1019357822, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASBTYIOEFLQAUJ6YZXTELUXMLVZANCNFSM5HH4C6DQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

gamanakis commented 2 years ago

Yes, that is the default but doesn't affect raw send...

Do you have xattr=sa for that dataset? Also does that particular file have anything peculiar in terms of system attributes? Like symlinks, dates or any other attributes?

gamanakis commented 2 years ago

The current ZTS has a test: tests/zfs-tests/tests/functional/rsend/send_spill_block.ksh which adds xattrs to files to fill spill blocks, sends a snapshot, removes xattrs, sends an incremental snapshot and verifies all went good. I modified that test to use encryption and raw sends, and it still works as predicted.

I will probably submit a PR using the modified test using raw sends just in case something breaks in the future.

However, the problem in the current issue is that somehow dnp->dn_bonuslen and dnp->dn_flags, stored on-disk, contradict each other.

putnam commented 2 years ago

To answer your questions on the dataset, here is zfs get all on the set in question. xattr=sa is indeed set.

NAME              PROPERTY              VALUE                  SOURCE
tank/enc/private  type                  filesystem             -
tank/enc/private  creation              Thu Oct 28  5:45 2021  -
tank/enc/private  used                  1.43T                  -
tank/enc/private  available             250T                   -
tank/enc/private  referenced            1.23T                  -
tank/enc/private  compressratio         1.02x                  -
tank/enc/private  mounted               yes                    -
tank/enc/private  quota                 none                   default
tank/enc/private  reservation           none                   default
tank/enc/private  recordsize            1M                     inherited from tank
tank/enc/private  mountpoint            /tank/private          local
tank/enc/private  sharenfs              off                    default
tank/enc/private  checksum              on                     default
tank/enc/private  compression           lz4                    inherited from tank
tank/enc/private  atime                 off                    inherited from tank
tank/enc/private  devices               on                     default
tank/enc/private  exec                  on                     default
tank/enc/private  setuid                on                     default
tank/enc/private  readonly              off                    local
tank/enc/private  zoned                 off                    default
tank/enc/private  snapdir               hidden                 default
tank/enc/private  aclmode               discard                default
tank/enc/private  aclinherit            restricted             default
tank/enc/private  createtxg             2734                   -
tank/enc/private  canmount              on                     default
tank/enc/private  xattr                 sa                     inherited from tank
tank/enc/private  copies                1                      default
tank/enc/private  version               5                      -
tank/enc/private  utf8only              off                    -
tank/enc/private  normalization         none                   -
tank/enc/private  casesensitivity       sensitive              -
tank/enc/private  vscan                 off                    default
tank/enc/private  nbmand                off                    default
tank/enc/private  sharesmb              off                    default
tank/enc/private  refquota              none                   default
tank/enc/private  refreservation        none                   default
tank/enc/private  guid                  16023833233617790934   -
tank/enc/private  primarycache          all                    default
tank/enc/private  secondarycache        all                    default
tank/enc/private  usedbysnapshots       203G                   -
tank/enc/private  usedbydataset         1.23T                  -
tank/enc/private  usedbychildren        0B                     -
tank/enc/private  usedbyrefreservation  0B                     -
tank/enc/private  logbias               latency                default
tank/enc/private  objsetid              2362                   -
tank/enc/private  dedup                 off                    default
tank/enc/private  mlslabel              none                   default
tank/enc/private  sync                  standard               default
tank/enc/private  dnodesize             legacy                 default
tank/enc/private  refcompressratio      1.01x                  -
tank/enc/private  written               0                      -
tank/enc/private  logicalused           1.48T                  -
tank/enc/private  logicalreferenced     1.25T                  -
tank/enc/private  volmode               default                default
tank/enc/private  filesystem_limit      none                   default
tank/enc/private  snapshot_limit        none                   default
tank/enc/private  filesystem_count      none                   default
tank/enc/private  snapshot_count        none                   default
tank/enc/private  snapdev               hidden                 default
tank/enc/private  acltype               off                    default
tank/enc/private  context               none                   default
tank/enc/private  fscontext             none                   default
tank/enc/private  defcontext            none                   default
tank/enc/private  rootcontext           none                   default
tank/enc/private  relatime              off                    default
tank/enc/private  redundant_metadata    all                    default
tank/enc/private  overlay               on                     default
tank/enc/private  encryption            aes-256-gcm            -
tank/enc/private  keylocation           none                   default
tank/enc/private  keyformat             passphrase             -
tank/enc/private  pbkdf2iters           350000                 -
tank/enc/private  encryptionroot        tank/enc               -
tank/enc/private  keystatus             available              -
tank/enc/private  special_small_blocks  0                      default

As for the file itself, uhh, not sure how I didn't see this sooner but I get an unusual result from ls:

# ls -lah MAH01447.MP4
ls: MAH01447.MP4: Bad address
-rw-r--r-- 1 putnam putnam 38M Sep  9  2014 MAH01447.MP4

Bad address?

Then trying to get its extended attributes:

# getfattr -d -m - MAH01447.MP4 
getfattr: MAH01447.MP4: Bad address

OK, so clearly the xattr is the issue:

# cp --preserve=all MAH01447.MP4 /tmp/
cp: listing attributes of 'MAH01447.MP4': Bad address

The file itself is not corrupt. It's a video of a porcupine on my fence. It plays fine off a network share. I can also see its contents with xxd.

Stat is OK:

$ stat MAH01447.MP4 
  File: MAH01447.MP4
  Size: 38809950    Blocks: 75941      IO Block: 131072 regular file
Device: 34h/52d Inode: 269839      Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/  putnam)   Gid: ( 1000/  putnam)
Access: 2015-01-25 06:21:31.505818265 -0600
Modify: 2014-09-09 17:49:33.000000000 -0500
Change: 2015-03-08 13:09:01.539228814 -0500
 Birth: 2015-01-25 06:21:31.505818265 -0600

As you can see it hasn't been touched in a long time. It is plausible that an extended attribute got set via Samba and Finder though.

putnam commented 2 years ago

There is a second file in the same folder with the same error:

ls: DSC01446.ARW: Bad address
ls: MAH01447.MP4: Bad address
total 616299
# ls -li DSC01446.ARW 
ls: DSC01446.ARW: Bad address
457084 -rw-rw---- 1 putnam putnam 37322752 Sep  9  2014 DSC01446.ARW

So object 457084 may also fail during the send, if it had made it that far.

putnam commented 2 years ago

I'm not sure here but could this be related to #2700 somehow? The parent directory does have the group sticky bit set (g+s). I use this all over the place without any issue typically.

Also, note that I was able to zfs send this dataset multiple times -- once from the original unencrypted pool to my offsite, which was encrypted, then again back to the original, also encrypted (but not a raw send).

gamanakis commented 2 years ago

Thanks for the input! That file has corrupt xattrs which are stored in spill blocks when zfs property "dnodesize = legacy" (default) is set (I think the manual recommends setting this to auto but it's not the default right now). The corruption here manifestates as dnp->dn_bonuslen and dnp->dn_flags contradicting each other.

I am not totally surprised this was not reported in a non-raw send: looking through the code I didn't spot any assertions/checks validating the relationship between dnp->dn_bonuslen and dnp->dn_flags (I might have overlooked some parts though).

Regarding #2700, it was resolved by https://github.com/openzfs/zfs/commit/4254acb05743dc2175ae76f6e15b0785d4b688fd. Initially I thought the test case described in https://github.com/openzfs/zfs/commit/4254acb05743dc2175ae76f6e15b0785d4b688fd was the cause of this. @putnam Do you have that pool/files around since before zfs-0.6.4 where https://github.com/openzfs/zfs/commit/4254acb05743dc2175ae76f6e15b0785d4b688f is included? Also what do zpool scrub or zpool status report?

gamanakis commented 2 years ago

@behlendorf would you mind having a look at this?

This issue may manifestate as silent corruption of xattrs when dnodesize=legacy because of dnp->dn_bonuslen and dnp->dn_flags contradicting with each other (https://github.com/openzfs/zfs/issues/12720#issuecomment-1019353017). I was thinking about adding an assert at the end of dnode_sync(). It only becomes apparent when doing encrypted raw sending.

putnam commented 2 years ago

The files are at least as old as their birth/mtime, and I've been using ZFS for longer. With 0.6.4 releasing in August 2015 it would make sense these files hit the pool earlier than that. I have generally always used zfs send, not rsync, for migrations. Looking at my eternal bash history I see these folders existing and being manipulated in 2014 on ZFS. So yeah the ZFS "provenance" would go back earlier than 0.6.4.

I kicked off a regular send out of curiosity and it did not hang up on those files, as you suggested just now.

Odd that this check would only be in the raw flow since it would not only affect raw sends, right? Makes me wonder whether that code should be duplicated the way it is rather than use a shared check function. I'm not well-versed on the internal architecture so I don't want to make dumb assumptions though.

Scrub does not report any issues. I scrub monthly and act on any issues; zpool status is OK as well.

putnam commented 2 years ago

Since the xattr's are really not important on these files, and it sounds like this was a bug fixed way back in 0.6.4, I think simply copying the files to rewrite them may be the solution here.

I used this to try to track down any straggler files and it really is just these two:

$ find . -type f -print0 | xargs -0 ls -l > /dev/null
ls: ./Camera/2014/2014.7-2014.12/Unsorted/DSC01446.ARW: Bad address
ls: ./Camera/2014/2014.7-2014.12/Unsorted/MAH01447.MP4: Bad address

No telling what the event was that may have triggered the sa corruption, but it probably was something Finder or Samba did. Adjacent files had only a DOSATTRIB xattr that would have been written by Samba at some point or another, but it's decently long...

So, I will go ahead and rewrite these files to wipe the xattr corruption away, unless you want me to keep this repro alive for a bit longer. It does sound useful to add those checks to scrub/non-raw send though!

As a side note: what is the penalty for not setting dnodesize=auto? If it is left on legacy, is there some limitation? I don't anticipate needing to import this pool on something that can't handle auto, but just want to better understand the utility. The man page hints that the benefit is performance-related but doesn't go into detail.

gamanakis commented 2 years ago

The man page hints that the benefit is performance-related but doesn't go into detail.

Enabling large dnodes essentially depreciates the need for spill blocks. That means that metadata can be read in fewer requests as opposed to filing new read requests for spill blocks.

I was thinking about adding an assert in dnode_sync() preventing the writing of new invalid dnodes regarding dn_bonuslen and dn_flags, and also a check in dump_dnode() that would error out if this occurs while non-raw and raw sending.

putnam commented 2 years ago

One issue: wouldn't these objects taint snapshots, too? So I will lose my snapshot history. Is there some way to extricate the corrupt files from snapshots? Or any suggestion you may have that might preserve snapshots while fixing the issue?

Edit: One thought -- replace the files, snapshot it, clone that to a new fs, and begin using that one going forward. Keep the old fs readonly until the snaps are no longer needed. I think this solution will avoid the duplication issue of making a new fs and rsync'ing into it. There is a possible advantage to migrating via rsync in that setting dnodesize=auto will apply to all the new writes and may result in a performance improvement.

gamanakis commented 2 years ago

@putnam I agree with you. We cannot really fix the snapshots/filesystems affected by this since we don't know which of the two, dn_bonuslen or dn_flags is correct at this point.

As a side-note: there may be a bug in the dnode byteswapping code, I filed #13002 for that.

putnam commented 2 years ago

Great work @gamanakis ! Thank you for checking into this so quickly and thoroughly!