Closed putnam closed 2 years ago
I'm actually not sure what to do here. I can't send a raw backup, full stop. And if I send it non-raw, I can't switch the incrementals to be raw because the IVs won't be the same. What am I meant to do in this situation? I will try running 2.1.0 in a VM or something on the target host to see if it makes any difference.
In theory, that message isn't supposed to happen unless you're sending from a newer version than being received on, and somehow it didn't notice a missing feature required until later, at which point the "fix" is pretty obvious.
In practice, that message comes out for all sorts of edge cases, and could probably stand to be refined, though I'd personally just suggest changing it to say something like "Either you need to upgrade your receiver to a newer version or, if they're not different (or the receiver is newer), you've hit an edge case we aren't handling, please report a bug."
Would you possibly be willing to share a full (e.g. not solely the resume portion, though that might be helpful in addition to the whole) raw (e.g. encrypted data blocks, though not some metadata) send stream with someone to debug? I'm probably not an ideal candidate, not being familiar with the over the wire format of send/recv or all the places encrypted recv touches, but it'd probably be very helpful if someone could repro it locally and iterate on whatever's going awry rather than having to round trip to you for testing each time. (It's perfectly okay to say no if you're not comfortable;; it would just be useful to know )
I suspect it'll just say "lol it's encrypted go pound sand", from recollection, but what do zdb -dbdbdbdbdbdb tank/enc/private@zfs-auto-snap_monthly-2020-07-01-1152 269839
and [...] 269838
have to say?
You said the other system is on bullseye with 2.0.6, is that from backports? Could you give us the full output of zfs version
on both hosts, just for completeness?
Also just for completeness, could you try receiving the stream locally somewhere on the sender to see if it errors the same way?
One final remark - while you're correct that you couldn't go from non-raw to raw send, you could still do non-raw send and enable encryption on the receiver like with any non-raw send being received. This of course doesn't work if you don't trust the receiver to have access to the encryption key, but that may or may not be an issue in your environment.
Thanks for your reply, @rincebrain
I'll try to address everything in order.
Sharing a full send is going to be hard. It's 1.3TB, and it's a dataset I'd prefer not to directly share anyway. But I realize that it may only repro on this exact snapshot on this exact dataset. I don't mind the back-and-forth, though.
Attached is the output of zdb for each of those objects as requested: zdb_269838.txt zdb_269839.txt
The target system is on 2.0.6 via backports.
Source zfs version:
zfs-2.0.6-1
zfs-kmod-2.0.6-1
Target zfs version:
zfs-2.0.6-1~bpo11+1
zfs-kmod-2.0.6-1~bpo11+1
I just tried sending the stream locally from source to a new dataset with the same error at the same position:
# zfs send -Lcw tank/enc/private@zfs-auto-snap_monthly-2020-07-01-1152 | mbuffer | zfs recv -u tank/privatetest
in @ 128 MiB/s, out @ 0.0 kiB/s, 360 GiB total, buffer 10% full
cannot receive new filesystem stream: invalid backup stream
As a side note, the performance of zfs send here has been relatively poor. Before I switched to encryption I could zfs send at a rate of about 1-1.5 GiB/second. Unlike sequential scrubs, zfs send seems to be IOP bound. I'm not sure what makes the encrypted version so much slower. The pool topology on the source machine is 5x11 raidz2. Target is 3x12 raidz2. All are spinning disks between 12T-18T in size.
Regarding the ability to still send a backup unencrypted, it is indeed the case that I want raw sends to work since they'll be going to an untrusted offsite system.
Sharing a full send is going to be hard. It's 1.3TB, and it's a dataset I'd prefer not to directly share anyway. But I realize that it may only repro on this exact snapshot on this exact dataset. I don't mind the back-and-forth, though.
Okay, I just figured it'd be worth asking (though it'd also be interesting to see if you could create a clone with only the minimal data necessary to reproduce - possibly by doing a zfs clone of the snapshot, finding just the specific files those IDs I asked you for the zdb data for map to, and nuking everything but them and their containing dirs...you could find the specific files with find /path/to/mounted/snapshot/or/clone -type f -inum [the number in question]
, though this may take a bit. It's also likely to be error prone, so mayhaps clone, snapshot the clone, verify the clone snapshot still errors on send in the same place, then nuke everything but the entire unmodified directory and parents of those objects, then snapshot that and make sure the snapshot still reproduces, and so on...)
I don't know of any performance bottlenecks specific to only raw send - I'd be curious to check out a flamegraph of send -Lec versus send -w on the same dataset over, say, sleep 120 - it might be IO bound, but I'd like to see if there's any visibly different callstacks that turn up...
RAIDZs aren't particularly made of IOPs in general - what makes you think it's IOP bound and not CPU or other bound? (If it's "it's in iowait a bunch", that could mean many things.)
Yeah, I'm not surprised that unencrypted doesn't work for you, just thought I'd ask. :/
Good idea on the clone strategy. Worked a charm, repros now with a 73M dataset.
Here are my notes as I did this, just for reference:
# Created clone
zfs clone tank/enc/private@zfs-auto-snap_monthly-2020-07-01-1152 tank/privatetest
# Initial clone snapshot (no modifications)
zfs snapshot tank/privatetest@clonesnap
# Sending full clone triggers same error at same position
zfs send -Lcw tank/privatetest@clonesnap |mbuffer|zfs recv -u tank/privatetest_receive
in @ 128 MiB/s, out @ 0.0 kiB/s, 360 GiB total, buffer 21% full
cannot receive new filesystem stream: invalid backup stream
# Whittling down filesystem to only affected inodes
find /tank/privatetest -type f -inum 269838 -or -inum 269839
# deleted all other folders, size is now 285M
zfs snapshot tank/privatetest@afterdeletion
# tried send again, failed again (faster this time)
zfs send -Lcw tank/privatetest@afterdeletion | zfs recv -u tank/privatetest_receive
cannot receive new filesystem stream: invalid backup stream
# deleted sibling files to minimize size, snapped, tested again (73M)
zfs snapshot tank/privatetest@afterdeletion2
zfs send -Lcw tank/privatetest@afterdeletion2 | zfs recv -u tank/privatetest_receive
cannot receive new filesystem stream: invalid backup stream
Here is a dropbox link of the zfs send: https://www.dropbox.com/s/p7yykx47syl7jm5/zfs_send_culled_private.bin.zip?dl=0
You're right that I sort of assumed it was IOP-bound simply because of frequent iowait on zfs send
. CPU usage is only 15-20% of one core. I will definitely dig into this separately once I get past the send issue.
To add a little more context if it matters, here is this dataset's zfs send/recv "provenance". Originally it was unencrypted on a single host running normally. When I prepared the offsite, I wanted to also take this opportunity to transfer the data back-and-forth to get encryption on both hosts and also rebalance data across vdevs.
The unencrypted dataset was sent to the backup machine into an encrypted dataset, then sent back into the primary pool so it would be encrypted -- I did not use a raw send here, though, and in order for raw sends to work on the offsite I needed to once again wipe the pool on the offsite and now send the dataset with the raw flag.
It would have been much easier to have used a raw send, but there is a long standing bug that directly affected my workflow here. Once I sent the dataset to the offsite, I needed to briefly wipe the source pool and send everything back, but I kept the source system online by mounting the offsite over nfs. This resulted in the dataset changing and needing to be sent back, which is affected by that bug. I confirmed it by dry-running before I started this journey and didn't want to take a risk. So I decided to do an extra transfer once everything was situated on the primary host so the raw sends can start clean and never risk being modified by the offsite and being sent back.
tl;dr:
The most recent zfs send to work fine was #3. The affected files are not new at all. Old stuff.
Just to be clear, did you send just this dataset, or was it part of a hierarchy of datasets being sent around?
You could always try cherrypicking/un-reverting d1d47691c256cd7e15af7819ea33024e6693e9c5 and seeing if it helps your life.
Only this dataset in this case. Not a hierarchy.
On Thu, Nov 11, 2021 at 4:07 PM Rich Ercolani @.***> wrote:
Just to be clear, did you send just this dataset, or was it part of a hierarchy of datasets being sent around?
You could always try cherrypicking/un-reverting d1d4769 https://github.com/openzfs/zfs/commit/d1d47691c256cd7e15af7819ea33024e6693e9c5 and seeing if it helps your life.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/12720#issuecomment-966659299, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASBT6QMF2ZHMRNQCWIEC3ULQ5AHANCNFSM5HH4C6DQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I'm a little surprised -- is nobody working on bugs on raw sends? I can reliably reproduce this and provided an easy repro case. I'd definitely like to dig in myself but it's a lot to spin up. Maybe I hit some edge case as I'm surprised nobody else seems to have run into this. As it stands I'm just waiting until 2.1.1 lands in Debian (the packager finally uploaded to experimental a week ago) to try to test further.
Not quite nobody, no. Just almost.
I've failed to convince anybody else so far to work on these, but I'm trying to knock them out one at a time, unfortunately slowly.
I believe the root cause(s) for some of the bugs that work similarly to this one are understood, just no fix written and remaining committed yet.
As far as how this situation arose, it seems it works well for the few big companies that use it, and there haven't been many in the intersection of {willing to try fixing, uses native encryption} - even I don't meet that criteria, I don't use it normally.
I'm displeased by this, but I'm already doing all I can about it.
On Wed, Dec 8, 2021, 4:38 PM Chris Putnam @.***> wrote:
I'm a little surprised -- is nobody working on bugs on raw sends? I can reliably reproduce this and provided an easy repro case. I'd definitely like to dig in myself but it's a lot to spin up. Maybe I hit some edge case as I'm surprised nobody else seems to have run into this. As it stands I'm just waiting until 2.1.1 lands in Debian (the packager finally uploaded to experimental a week ago) to try to test further.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/12720#issuecomment-989216949, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7NITOV47BODKQ6VEGLUP7F5VANCNFSM5HH4C6DQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I'm a little surprised -- is nobody working on bugs on raw sends?
for what it's worth, this feature has been broken since it landed and it's received no love from the dev team. Datto contributed it but doesn't wish to contribute to its maintenance. this is one reason I'm skeptical of all the new features (dRAID, Object storage) - they're running the BTRFS playbook, adding feature after feature without validating the existing functionality. all of that, on top of major reorganisation of code as platforms consolidate on OpenZFS.
i honestly don't blame anyone who doesn't want to upgrade from 0.8.x
I have updated to 2.1.1 after it landed in Debian unstable.
The send stream generated by 2.1.1 is identical to the send stream generated by 2.0.6. The operation fails in the same way:
$ cat /sys/module/zfs/version
2.1.1-3
$ zfs send -Lcw tank/privatetest@afterdeletion2 > zfs_send_culled_private_2.1.1.bin
$ diff -s zfs_send_culled_private.bin zfs_send_culled_private_2.1.1.bin
Files zfs_send_culled_private.bin and zfs_send_culled_private_2.1.1.bin are identical
# (this is the equivalent of downloading my repro and catting it to zfs recv)
$ zfs send -Lcw tank/privatetest@afterdeletion2 | zfs recv -u tank/privatetest_receive
cannot receive new filesystem stream: invalid backup stream
@putnam do you have a reproducer for this one?
Yes, please see this comment above: https://github.com/openzfs/zfs/issues/12720#issuecomment-964677615
You can grab the ~70MB repro via the Dropbox link at the bottom. Try receiving that into a new dataset.
4849 case EINVAL:
4850 if (flags->resumable) {
4851 zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
4852 "kernel modules must be upgraded to "
4853 "receive this stream."));
4854 } else if (embedded && !raw) {
4855 zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
4856 "incompatible embedded data stream "
4857 "feature with encrypted receive."));
4858 }
4859 (void) zfs_error(hdl, EZFS_BADSTREAM, errbuf);
4861 break;
It's the above part of zfs_receive_one()
that is producing the error. The EINVAL
comes from receive_objsect()
:
1588 if (rwa->raw) {
1589 /*
1590 * We should have received a DRR_OBJECT_RANGE record
1591 * containing this block and stored it in rwa.
1592 */
1593 if (drro->drr_object < rwa->or_firstobj ||
1594 drro->drr_object >= rwa->or_firstobj + rwa->or_numslots ||
1595 drro->drr_raw_bonuslen < drro->drr_bonuslen ||
1596 drro->drr_indblkshift > SPA_MAXBLOCKSHIFT ||
1597 drro->drr_nlevels > DN_MAX_LEVELS ||
1598 drro->drr_nblkptr > DN_MAX_NBLKPTR ||
1599 DN_SLOTS_TO_BONUSLEN(dn_slots) <
1600 drro->drr_raw_bonuslen) {
1601 return (SET_ERROR(EINVAL));
Now there are a bunch of possibilities here, let's see which one is failing.
drro->drr_raw_bonuslen < drro->drr_bonuslen
is failing, leading to the error. Let me see if I can figure out what is going on.
The drr_raw_bonuslen
is set in dump_dnode()
:
760 /*
761 * Since we encrypt the entire bonus area, the (raw) part
762 * beyond the bonuslen is actually nonzero, so we need
763 * to send it.
764 */
765 if (bonuslen != 0) {
766 drro->drr_raw_bonuslen = DN_MAX_BONUS_LEN(dnp);
767 bonuslen = drro->drr_raw_bonuslen;
768 }
These parts of dmu_recv.c
and dmu_send.c
, all related to raw_bonuslen are part of commit b525630342.
Inspecting the raw dataset from @putnam with cat ~/zfs_send_culled_private.bin | zstreamdump -vvv | grep bonuslen | less
gives:
OBJECT object = 1 type = 21 bonustype = 0 blksz = 512 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 3
OBJECT object = 2 type = 45 bonustype = 0 blksz = 512 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 3
OBJECT object = 3 type = 22 bonustype = 0 blksz = 1536 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 3
OBJECT object = 4 type = 20 bonustype = 44 blksz = 512 bonuslen = 168 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 5 type = 46 bonustype = 0 blksz = 1536 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 3
OBJECT object = 6 type = 47 bonustype = 0 blksz = 16384 bonuslen = 0 dn_slots = 1 raw_bonuslen = 0 flags = 0 maxblkid = 1 indblkshift = 17 nlevels = 1 nblkptr = 3
OBJECT object = 7 type = 20 bonustype = 44 blksz = 16384 bonuslen = 168 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 1 indblkshift = 17 nlevels = 2 nblkptr = 1
OBJECT object = 269838 type = 19 bonustype = 44 blksz = 131072 bonuslen = 192 dn_slots = 1 raw_bonuslen = 192 flags = 4 maxblkid = 281 indblkshift = 17 nlevels = 2 nblkptr = 1
OBJECT object = 269839 type = 19 bonustype = 44 blksz = 131072 bonuslen = 288 dn_slots = 1 raw_bonuslen = 192 flags = 4 maxblkid = 296 indblkshift = 17 nlevels = 2 nblkptr = 1
OBJECT object = 327985 type = 20 bonustype = 44 blksz = 16384 bonuslen = 168 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 1 indblkshift = 17 nlevels = 2 nblkptr = 1
OBJECT object = 331405 type = 20 bonustype = 44 blksz = 512 bonuslen = 168 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 389480 type = 20 bonustype = 44 blksz = 11776 bonuslen = 288 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 464994 type = 20 bonustype = 44 blksz = 8192 bonuslen = 288 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 464996 type = 20 bonustype = 44 blksz = 1536 bonuslen = 288 dn_slots = 1 raw_bonuslen = 320 flags = 0 maxblkid = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
Which shows the problematic one:
OBJECT object = 269839 type = 19 bonustype = 44 blksz = 131072 bonuslen = 288 dn_slots = 1 raw_bonuslen = 192 flags = 4 maxblkid = 296 indblkshift = 17 nlevels = 2 nblkptr = 1
The drr_raw_bonuslen
is calculated in dnode.h
:
124 #define DN_MAX_BONUS_LEN(dnp) \
125 ((dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) ? \
126 (uint8_t *)DN_SPILL_BLKPTR(dnp) - (uint8_t *)DN_BONUS(dnp) : \
127 (uint8_t *)(dnp + (dnp->dn_extra_slots + 1)) - (uint8_t *)DN_BONUS(dnp))
and since we know that the flag SPILL_BLKPTR
is set (see zdb dump in https://github.com/openzfs/zfs/issues/12720#issuecomment-963927644), we have:
126 (uint8_t *)DN_SPILL_BLKPTR(dnp) - (uint8_t *)DN_BONUS(dnp) : \
Right now, I cannot see how drro->drr_raw_bonuslen < drro->drr_bonuslen
could happen, since the raw bonuslen is always set to DN_MAX_BONUS_LEN
.
@putnam If you would absolutely like to try to retrieve the data using raw receiving you could try the following patch (with every precaution, I was able to raw receive that dataset in a new pool):
diff --git a/module/zfs/dmu_recv.c b/module/zfs/dmu_recv.c
index f132219c9..5349e51d5 100644
--- a/module/zfs/dmu_recv.c
+++ b/module/zfs/dmu_recv.c
@@ -1592,7 +1592,6 @@ receive_object(struct receive_writer_arg *rwa, struct drr_object *drro,
*/
if (drro->drr_object < rwa->or_firstobj ||
drro->drr_object >= rwa->or_firstobj + rwa->or_numslots ||
- drro->drr_raw_bonuslen < drro->drr_bonuslen ||
drro->drr_indblkshift > SPA_MAXBLOCKSHIFT ||
drro->drr_nlevels > DN_MAX_LEVELS ||
drro->drr_nblkptr > DN_MAX_NBLKPTR ||
So, we have a dnode:
OBJECT object = 269839 type = 19 bonustype = 44 blksz = 131072 bonuslen = 288 dn_slots = 1 raw_bonuslen = 192 flags = 4 maxblkid = 296 indblkshift = 17 nlevels = 2 nblkptr = 1
with 1 slot, meaning that it is 512 bytes, leaving a space of 448 bytes (512-64). It also has 1 blkptr = 128 bytes. and, because it has the SPILL_BLKPTR set, a spill pointer = 128 bytes. Which means that the bonuslen should be 192 and not 288 as stored in the zfs send stream.
A bonuslen of 288 means: 288 bonus + 64 + 128 (1 blkptr) = 480 bytes, leaving only 512-480=32 bytes for a spill pointer whose size is 128 bytes, making this impossible.
I think this means that probably something was wrong with setting/clearing of DNODE_FLAG_SPILL_BLKPTR
or setting the bonuslen
of the dnode. In a 512 byte dnode we cannot have a bonuslen of 288 bytes when the SPILL_BLKPTR flag is set.
I just came across commit caf9dd209fdcfccabc2f32b3f23c5386ccfb896c which may be highly relevant. @putnam what is your zfs parameter "zfs_send_unmodified_spill_blocks" set to?
EDIT: this commit and zfs parameter don't apply to raw sends.
Thank you for digging into this!
The param is set to "1" for me. AFAIK this is the default(?)
On Sat, Jan 22, 2022 at 3:04 PM George Amanakis @.***> wrote:
I just came across commit caf9dd2 https://github.com/openzfs/zfs/commit/caf9dd209fdcfccabc2f32b3f23c5386ccfb896c which may be highly relevant. @putnam https://github.com/putnam what is your zfs parameter "zfs_send_unmodified_spill_blocks" set to?
— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/12720#issuecomment-1019357822, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASBTYIOEFLQAUJ6YZXTELUXMLVZANCNFSM5HH4C6DQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
Yes, that is the default but doesn't affect raw send...
Do you have xattr=sa
for that dataset? Also does that particular file have anything peculiar in terms of system attributes? Like symlinks, dates or any other attributes?
The current ZTS has a test:
tests/zfs-tests/tests/functional/rsend/send_spill_block.ksh
which adds xattrs to files to fill spill blocks, sends a snapshot, removes xattrs, sends an incremental snapshot and verifies all went good. I modified that test to use encryption and raw sends, and it still works as predicted.
I will probably submit a PR using the modified test using raw sends just in case something breaks in the future.
However, the problem in the current issue is that somehow dnp->dn_bonuslen
and dnp->dn_flags
, stored on-disk, contradict each other.
To answer your questions on the dataset, here is zfs get all
on the set in question. xattr=sa is indeed set.
NAME PROPERTY VALUE SOURCE
tank/enc/private type filesystem -
tank/enc/private creation Thu Oct 28 5:45 2021 -
tank/enc/private used 1.43T -
tank/enc/private available 250T -
tank/enc/private referenced 1.23T -
tank/enc/private compressratio 1.02x -
tank/enc/private mounted yes -
tank/enc/private quota none default
tank/enc/private reservation none default
tank/enc/private recordsize 1M inherited from tank
tank/enc/private mountpoint /tank/private local
tank/enc/private sharenfs off default
tank/enc/private checksum on default
tank/enc/private compression lz4 inherited from tank
tank/enc/private atime off inherited from tank
tank/enc/private devices on default
tank/enc/private exec on default
tank/enc/private setuid on default
tank/enc/private readonly off local
tank/enc/private zoned off default
tank/enc/private snapdir hidden default
tank/enc/private aclmode discard default
tank/enc/private aclinherit restricted default
tank/enc/private createtxg 2734 -
tank/enc/private canmount on default
tank/enc/private xattr sa inherited from tank
tank/enc/private copies 1 default
tank/enc/private version 5 -
tank/enc/private utf8only off -
tank/enc/private normalization none -
tank/enc/private casesensitivity sensitive -
tank/enc/private vscan off default
tank/enc/private nbmand off default
tank/enc/private sharesmb off default
tank/enc/private refquota none default
tank/enc/private refreservation none default
tank/enc/private guid 16023833233617790934 -
tank/enc/private primarycache all default
tank/enc/private secondarycache all default
tank/enc/private usedbysnapshots 203G -
tank/enc/private usedbydataset 1.23T -
tank/enc/private usedbychildren 0B -
tank/enc/private usedbyrefreservation 0B -
tank/enc/private logbias latency default
tank/enc/private objsetid 2362 -
tank/enc/private dedup off default
tank/enc/private mlslabel none default
tank/enc/private sync standard default
tank/enc/private dnodesize legacy default
tank/enc/private refcompressratio 1.01x -
tank/enc/private written 0 -
tank/enc/private logicalused 1.48T -
tank/enc/private logicalreferenced 1.25T -
tank/enc/private volmode default default
tank/enc/private filesystem_limit none default
tank/enc/private snapshot_limit none default
tank/enc/private filesystem_count none default
tank/enc/private snapshot_count none default
tank/enc/private snapdev hidden default
tank/enc/private acltype off default
tank/enc/private context none default
tank/enc/private fscontext none default
tank/enc/private defcontext none default
tank/enc/private rootcontext none default
tank/enc/private relatime off default
tank/enc/private redundant_metadata all default
tank/enc/private overlay on default
tank/enc/private encryption aes-256-gcm -
tank/enc/private keylocation none default
tank/enc/private keyformat passphrase -
tank/enc/private pbkdf2iters 350000 -
tank/enc/private encryptionroot tank/enc -
tank/enc/private keystatus available -
tank/enc/private special_small_blocks 0 default
As for the file itself, uhh, not sure how I didn't see this sooner but I get an unusual result from ls:
# ls -lah MAH01447.MP4
ls: MAH01447.MP4: Bad address
-rw-r--r-- 1 putnam putnam 38M Sep 9 2014 MAH01447.MP4
Bad address?
Then trying to get its extended attributes:
# getfattr -d -m - MAH01447.MP4
getfattr: MAH01447.MP4: Bad address
OK, so clearly the xattr is the issue:
# cp --preserve=all MAH01447.MP4 /tmp/
cp: listing attributes of 'MAH01447.MP4': Bad address
The file itself is not corrupt. It's a video of a porcupine on my fence. It plays fine off a network share. I can also see its contents with xxd.
Stat is OK:
$ stat MAH01447.MP4
File: MAH01447.MP4
Size: 38809950 Blocks: 75941 IO Block: 131072 regular file
Device: 34h/52d Inode: 269839 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ putnam) Gid: ( 1000/ putnam)
Access: 2015-01-25 06:21:31.505818265 -0600
Modify: 2014-09-09 17:49:33.000000000 -0500
Change: 2015-03-08 13:09:01.539228814 -0500
Birth: 2015-01-25 06:21:31.505818265 -0600
As you can see it hasn't been touched in a long time. It is plausible that an extended attribute got set via Samba and Finder though.
There is a second file in the same folder with the same error:
ls: DSC01446.ARW: Bad address
ls: MAH01447.MP4: Bad address
total 616299
# ls -li DSC01446.ARW
ls: DSC01446.ARW: Bad address
457084 -rw-rw---- 1 putnam putnam 37322752 Sep 9 2014 DSC01446.ARW
So object 457084 may also fail during the send, if it had made it that far.
I'm not sure here but could this be related to #2700 somehow? The parent directory does have the group sticky bit set (g+s). I use this all over the place without any issue typically.
Also, note that I was able to zfs send this dataset multiple times -- once from the original unencrypted pool to my offsite, which was encrypted, then again back to the original, also encrypted (but not a raw send).
Thanks for the input! That file has corrupt xattrs which are stored in spill blocks when zfs property "dnodesize = legacy" (default) is set (I think the manual recommends setting this to auto but it's not the default right now). The corruption here manifestates as dnp->dn_bonuslen
and dnp->dn_flags
contradicting each other.
I am not totally surprised this was not reported in a non-raw send: looking through the code I didn't spot any assertions/checks validating the relationship between dnp->dn_bonuslen
and dnp->dn_flags
(I might have overlooked some parts though).
Regarding #2700, it was resolved by https://github.com/openzfs/zfs/commit/4254acb05743dc2175ae76f6e15b0785d4b688fd. Initially I thought the test case described in https://github.com/openzfs/zfs/commit/4254acb05743dc2175ae76f6e15b0785d4b688fd was the cause of this.
@putnam Do you have that pool/files around since before zfs-0.6.4
where https://github.com/openzfs/zfs/commit/4254acb05743dc2175ae76f6e15b0785d4b688f is included? Also what do zpool scrub
or zpool status
report?
@behlendorf would you mind having a look at this?
This issue may manifestate as silent corruption of xattrs when dnodesize=legacy
because of dnp->dn_bonuslen
and dnp->dn_flags
contradicting with each other (https://github.com/openzfs/zfs/issues/12720#issuecomment-1019353017). I was thinking about adding an assert at the end of dnode_sync()
. It only becomes apparent when doing encrypted raw sending.
The files are at least as old as their birth/mtime, and I've been using ZFS for longer. With 0.6.4 releasing in August 2015 it would make sense these files hit the pool earlier than that. I have generally always used zfs send, not rsync, for migrations. Looking at my eternal bash history I see these folders existing and being manipulated in 2014 on ZFS. So yeah the ZFS "provenance" would go back earlier than 0.6.4.
I kicked off a regular send out of curiosity and it did not hang up on those files, as you suggested just now.
Odd that this check would only be in the raw flow since it would not only affect raw sends, right? Makes me wonder whether that code should be duplicated the way it is rather than use a shared check function. I'm not well-versed on the internal architecture so I don't want to make dumb assumptions though.
Scrub does not report any issues. I scrub monthly and act on any issues; zpool status is OK as well.
Since the xattr's are really not important on these files, and it sounds like this was a bug fixed way back in 0.6.4, I think simply copying the files to rewrite them may be the solution here.
I used this to try to track down any straggler files and it really is just these two:
$ find . -type f -print0 | xargs -0 ls -l > /dev/null
ls: ./Camera/2014/2014.7-2014.12/Unsorted/DSC01446.ARW: Bad address
ls: ./Camera/2014/2014.7-2014.12/Unsorted/MAH01447.MP4: Bad address
No telling what the event was that may have triggered the sa corruption, but it probably was something Finder or Samba did. Adjacent files had only a DOSATTRIB xattr that would have been written by Samba at some point or another, but it's decently long...
So, I will go ahead and rewrite these files to wipe the xattr corruption away, unless you want me to keep this repro alive for a bit longer. It does sound useful to add those checks to scrub/non-raw send though!
As a side note: what is the penalty for not setting dnodesize=auto? If it is left on legacy, is there some limitation? I don't anticipate needing to import this pool on something that can't handle auto, but just want to better understand the utility. The man page hints that the benefit is performance-related but doesn't go into detail.
The man page hints that the benefit is performance-related but doesn't go into detail.
Enabling large dnodes essentially depreciates the need for spill blocks. That means that metadata can be read in fewer requests as opposed to filing new read requests for spill blocks.
I was thinking about adding an assert in dnode_sync()
preventing the writing of new invalid dnodes regarding dn_bonuslen
and dn_flags
, and also a check in dump_dnode()
that would error out if this occurs while non-raw and raw sending.
One issue: wouldn't these objects taint snapshots, too? So I will lose my snapshot history. Is there some way to extricate the corrupt files from snapshots? Or any suggestion you may have that might preserve snapshots while fixing the issue?
Edit: One thought -- replace the files, snapshot it, clone that to a new fs, and begin using that one going forward. Keep the old fs readonly until the snaps are no longer needed. I think this solution will avoid the duplication issue of making a new fs and rsync'ing into it. There is a possible advantage to migrating via rsync in that setting dnodesize=auto will apply to all the new writes and may result in a performance improvement.
@putnam I agree with you. We cannot really fix the snapshots/filesystems affected by this since we don't know which of the two, dn_bonuslen
or dn_flags
is correct at this point.
As a side-note: there may be a bug in the dnode byteswapping code, I filed #13002 for that.
Great work @gamanakis ! Thank you for checking into this so quickly and thoroughly!
System information
Describe the problem you're observing
I am trying to transfer an encrypted dataset to another pool using a raw send. Partially through this send, the receive side fails with an unclear error message.
The receiving system is also on 2.0.6, running Debian Bullseye on kernel 5.10.0-9.
Here are the commands that reproduce the issue reliably:
The dataset is about 1.3TB; at the 360GB mark, the receive reliably fails with the following error:
Oddly, if you provide the -s flag to the recv (enable resume support) the error changes to this:
I tried sending the stream through zstreamdump, which did not complain. I also maximized debug logging, generated a resume token and tried resuming (right at whatever transaction caused the recv to fail), which generated these messages (gist). It's noisy; see line 1786 for when receive_object failed.
I don't know the exact steps to reproduce this problem with another dataset. Originally this dataset was on another pool in an unencrypted state. It was sent into the encrypted pool just fine, then I wanted to raw-send the newly-encrypted dataset offsite to another pool. It is at this point that I ran into this show-stopper.
See also the gist below which is the output of
zstreamdump -vvv
on the resumed send.https://gist.github.com/putnam/3a467fdab6fe96ca048264c1bcd8a4b0
UPDATE: In the discussion below, I was able to shrink the affected dataset down to a ~70MB repro. See this comment for an attachment.