Sometimes raw send on encrypted datasets does not work when copying snapshots back

digitalsignalperson commented 3 years ago

System information

Type	Version/Name
Distribution Name	Arch Linux
Distribution Version	rolling
Kernel Version	5.14.8-arch1-1
Architecture	x86_64
OpenZFS Version	zfs-2.1.1-1

Describe the problem you're observing

I am able to send raw encrypted snapshots (incremental and replication streams) back and forth between file systems a limited number of times before getting a cannot mount 'rpool/mydataset': Input/output error and errors in zpool status.

I have tried many sequences of sends/receives with raw encrypted snapshots, sometimes I can pass back and forth only 1 time, others more. Below I will share two repeatable examples.

This seems like manifestation of the issue in "Raw send on encrypted datasets does not work when copying snapshots back #10523" which was previously resolved.

Describe how to reproduce the problem

Example 1 - fails on first send back

zfs create rpool/test_000 -o encryption=on -o keyformat=passphrase

# create some data and snapshots
touch /mnt/test_000/1.txt
zfs snapshot rpool/test_000@1
touch /mnt/test_000/2.txt
zfs umount rpool/test_000
zfs snapshot rpool/test_000@2

# send to a new encryption root
zfs send -Rw rpool/test_000@2 | zfs receive -u rpool/test_001

# modify data, snapshot, and send back
zfs mount -l rpool/test_001
touch /mnt/test_001/3.txt
zfs umount rpool/test_001
zfs snapshot rpool/test_001@3
zfs send -i @2 -w rpool/test_001@3 | zfs receive -u rpool/test_000

# try to mount
zfs mount rpool/test_000
# cannot mount 'rpool/test_000': Input/output error

Example 2 - more convoluted, but fails after a few back and forth

zfs create rpool/test_002 -o encryption=on -o keyformat=passphrase

# create some data and snapshots
touch /mnt/test_002/1.txt
zfs snapshot rpool/test_002@1
touch /mnt/test_002/2.txt
zfs snapshot rpool/test_002@2
touch /mnt/test_002/3.txt
zfs umount rpool/test_002
zfs snapshot rpool/test_002@3

# send to new encryption root (same steps as Example 1 so far)
zfs send -Rw rpool/test_002@3 | zfs recv -u rpool/test_003

# send to another new encryption root
zfs load-key rpool/test_003
zfs send -Rw rpool/test_003@3 | zfs receive -u rpool/test_004

# modify data, snapshot, and send back
zfs load-key rpool/test_004
zfs mount rpool/test_004
touch /mnt/test_004/4.txt
zfs snapshot rpool/test_004@4
zfs send -w -i @3 rpool/test_004@4 | zfs receive -u rpool/test_003

# try to mount - success where Example 1 failed and only difference is that extra send in-between
zfs mount rpool/test_003
ls /mnt/test_003

# modify data again and send back
touch /mnt/test_003/5.txt
umount rpool/test_003
zfs snapshot rpool/test_003@5
zfs send -w -i @4 rpool/test_003@5 | zfs receive -u rpool/test_004
ls /mnt/test_004/

# modify data and send back
touch /mnt/test_004/6.txt
zfs snapshot rpool/test_004@6
zfs send -w -i @5 rpool/test_004@6 | zfs receive -u rpool/test_003
zfs mount rpool/test_003
cannot mount 'rpool/test_003': Input/output error

At this point the output of zpool status -v includes

errors: Permanent errors have been detected in the following files:

        rpool/test_000:<0x0>
        rpool/test_003:<0x0>

If I rollback the last snapshot in question, then scrub once

zfs rollback -r rpool/test_000@2
zfs rollback -r rpool/test_003@5
zpool scrub rpool
zpool status -v

status still shows

errors: Permanent errors have been detected in the following files:

        rpool/test_000:<0x0>
        rpool/test_003:<0x0>

but if I scrub a second time

zpool scrub rpool
zpool status -v

end up with

errors: No known data errors

and if I repeat the last operation in question, it will get the same IO error again.

The steps are repeatable for me. I don't know if every step matters (e.g. extraneous load-key when I don't mount). I also have some other examples that fail at different points, but I figured these were simple enough to share.

marker5a commented 2 years ago

Did some further testing with this VM as configured in my last post.

with unpatched zfs-dkms, use the repro script to get to failure to mount dataset and with error in zpool status

export testpool, sudo pacman -R zfs-dkms, rmmod zfs and rebuild the package with both the patches

modprobe zfs, import testpool, mount testpool/test-source and testpool/test-dest successfully

still see zpool status error, but goes away after two scrubs

so using both patches (untested with just the first) in this scenario it seems to recover the data and lets you mount the previously unmountable dataset. Too bad it didn't work for @marker5a's recovery... maybe it's worth trying again based on it working here

First off, happy new year! I was away and unable to work on this too much, but did make some progress. Before going into that though, I'm using vagrant but I've sort of cobbled a mess together on the VM side, so I need to make a point to clean it up and can provide that as my test setup.

With that said, I will definitely give what you posted a run. From my testing, it seems that I suffer from a slightly different problem. I checked out the master zfs branch and ran the earlier scripts you and @putnam provided and was able to apply a patch that effectively ignored the HMAC check (or at least I'm rpetty sure it was) and that was able to successfully mount the src and dest in that testpool that was created.

With excitement, I switched over to trying the patched version on a copy of my corrupted system and I'm still getting the i/o error. I did start adding more dbgmsg output in my branch of code and it does seem to stem from the crypto routines returning a CRYPTO_INVALID_MAC. I'm now trying to pinpoint further a way to bypass that in other spots of the code... again, first goal is to recover code... second goal will be to try and "fix" the issue.

In any case, I'll try your vagrant config and see if that does resolve the i/o issue for short term recovery

Thanks!

psy0rz commented 2 years ago

Are there not unit tests? Why wouldn't there be a test for this case?

People trust ZFS and expect features in official releases to work. I am now subbed to dozens of showstopper bugs where the consensus is something like "well if a company cares enough maybe it'll be fixed oh well" -- why did these features get merged and released without adequate testing?

i'm also pretty stunned tbh. I expected more from the ZFS team. There ARE regression tests after all, and i think openzfs desgin should be test driven by default, since its so complicated.

This changes my feeling from "openzfs is rockstable" to..."dont use some stuff if you dont want to lose your data".

Edit: Now that i think about it: there probably are tests, but its an issue that only is triggered under certain circumstances. (that couldnt be predicted when writing the tests) Many of these issues seem to be edge cases or weird interactions.

digitalsignalperson commented 2 years ago

Why wouldn't there be a test for this case?

maybe we can take some ideas from the reproducing scripts in this thread to include as a new unit test in the next PR

@marker5a You could cherry-pick the commit here: gamanakis@c379a3c on top of zfs-2.1.0, zfs-2.1.1, or zfs-2.1.2.

That should resolve your problem. That commit just introduces a flag that marks the useraccounting metadata as invalid when being received, and so it forces their recalculation upon first mounting of the received dataset and avoids the error encountered otherwise.

@gamanakis any thoughts on submitting your patch? In my testing it resolved this issue

gamanakis commented 2 years ago

@digitalsignalperson I tried to resolve/merge this in #11300. The preferred solution would be to actually reset the responsible dnodes and not rely on a flag, but I failed to make it work. If time permits I will take another look.

marker5a commented 2 years ago

Why wouldn't there be a test for this case?

maybe we can take some ideas from the reproducing scripts in this thread to include as a new unit test in the next PR

@marker5a You could cherry-pick the commit here: gamanakis@c379a3c on top of zfs-2.1.0, zfs-2.1.1, or zfs-2.1.2. That should resolve your problem. That commit just introduces a flag that marks the useraccounting metadata as invalid when being received, and so it forces their recalculation upon first mounting of the received dataset and avoids the error encountered otherwise.

@gamanakis any thoughts on submitting your patch? In my testing it resolved this issue

@digitalsignalperson I ran your test sequence on my pool just as one last sanity check and it's still giving I/O errors. That's the bad news. The good news (at least for this issue), is that it appears that my failure may be related to a tangential issue/PR that I could have easily done on an earlier version of code (#12770) .

This would explain why the tests w/ the patched code yield different results than my messed up dataset. Also, it would possibly explain why I'm not seeing this on another machine w/ a similar backup strategy that exhibits zero issues. The troubled dataset I have on my end all stemmed from migrating to a backup pool to reconstruct the original pool (reconfigure disks, vdevs, etc)... and in that process could have very well been using an ashift of 9 on the original pool and switched to 12. Unfortunately, I don't know what ashift was set on the original pool since it's gone now, but that is a theory.

I created a test script w/ the ashift changes to see if the debug info follows a similar path in the code to what is shown on my failing dataset... if that's the case, my test case would no longer be relevant here... and from the looks of it, If I suffered from #12770, then it seems like my data is as good as trash :(

psy0rz commented 2 years ago

Because we cant always 100% trust zfs, I'm trying to create an intelligent zfs-compare tool that will compare the latest common snapshots in two pools. It will shasum the actual zvols and files, instead of relying on zfs. It will also transfer a remote dataset thats encrypted and only has a local key, so that the encryption key isnt needed remotely. Does anyone want this tool as well?

digitalsignalperson commented 2 years ago

@psy0rz I'd be curious to see how it works.

Blackclaws commented 2 years ago

Seems to be the same issue as: https://github.com/openzfs/zfs/issues/10523 Might it make sense to somehow merge all those issues into one and get rid of some of the duplicates? There appear to be a number of issues open on this subject and so far no fix since more than a year?

gamanakis commented 2 years ago

What's the reason not to try and get that approach merged in general?

@rincebrain the person who wrote the encryption (@tcaputi) suggested in PR #11300 that instead of introducing the new flag we should zero out the dnodes holding user accounting on the receiving side.

However, in the absence of a loaded key I fail to see how this is possible. If those dnodes are freed in the receiving side it starts searching for the key. I have pinged Tom again in this regard.

Blackclaws commented 2 years ago

@gamanakis Looking at the pull request there are a couple of things happening that apparently break when you do this. Wouldn't one solution be to just flag the raw received dataset so that this is done on first mount when the key has already been loaded? That way you don't run into missing key issues and evade a failing mount at the same time.

It seems from a cursory glance at the issue and pull request that the data is all well and there and its just some metadata that is corrupted. I guess a perfect solution would be to also correctly send the metadata, if that isn't possible flagging it for cleanup on next mount seems to be the next sensible thing to do.

gamanakis commented 2 years ago

to just flag the raw received dataset so that this is done on first mount when the key has already been loaded?

Right, that was my initial approach (draft gamanakis@c379a3c).

a perfect solution would be to also correctly send the metadata

This was the latter approach which I have trouble implementing.

Blackclaws commented 2 years ago

I think that this sort of flag is needed in any case and might be cleared by checking whether metadata is valid if in future releases valid metadata can be sent/received.

The problem is that if you look at forwards/backwards compatbility there are already versions with raw send out there that will send wrong metadata. This will have to be handled anyway unless you want to do a major version break for this. So preemptively flagging the received sets and then deciding on first mount whether the data you received was good or needs to be fixed is I think the best way to go forward.

gamanakis commented 2 years ago

There is not a problem with older versions. The flags are stored in a uint64_t and the placeholder for the new flag defaults to 0, ie inactivated by default.

Blackclaws commented 2 years ago

I think I wasn't clear in what I meant. I meant that while the idea of sending correct metadata is certainly the better longterm solution, its insufficient for solving the issues when sending from an old zfs version to a fixed one. Maybe I also didn't understand whether sending or receiving was actually the culprit here.

In the case where the sending side has to be upgraded to fix the metadata issue we need to add the Dirty flag anyway so that we can check a received dataset because we can't reliably know if its from a new enough version right? I'm unfortunately not very knowledgeable about zfs internals.

gamanakis commented 2 years ago

I misunderstood what you said, you are right I think. I will do some cleanup and open a new PR introducing the flag,

rincebrain commented 2 years ago

I wonder how pathological it could be to add a case for "if I fail to decrypt specifically the userobj_accounting metadata, just throw it out", possibly guarded by a tunable. (That, or a zhack command to go forcibly set the "clear on next open" flag.)

It'd be a shame for people to have to throw out pre-patch recvs.

gamanakis commented 2 years ago

I suspect PR #12981 (update of #11300) resolves this. Anyone interested, feel free to try it out.

gamanakis commented 2 years ago

Both examples (1 and 2) in OP from @digitalsignalperson and the reproducer from @putnam complete without any errors with #12981 applied.

digitalsignalperson commented 2 years ago

Awesome thanks @gamanakis! I tested here in my vagrant box and seems to work. Looking forward to using raw sends

putnam commented 2 years ago

Thanks @gamanakis !!

I hope after this, #12720 can get some attention. I am still not able to use raw sends as the generated send stream contains odd/damaged objects that break the receive.

ipaqmaster commented 2 years ago

I experienced this last night reformatting a desktop with the expectation that I could zfs send -w my root, home and other encrypted datasets back to the machine from my nas. Only the ones using encryption (aes-256-gcm) gave "Input/Output error" responses when trying to mount them on the local machine after receiving. Unfortunately I didn't have the opportunity to try #12981 and just rsync'd everything to new encrypted datasets made locally overnight.

It's fun. The nas that these encrypted datasets were sent to could mount them just fine given the nature of the bug. But so could a laptop which I zfs send -w'd them to as well from that nas. The laptop was running openzfs version v2.1.1 on Archlinux kernel version 5.15.5. It is only the desktop which also received the encrypted datasets that could not mount them, experiencing the issue described here with a new zpool which was running the same kernel version and was downgraded to openzfs 2.1.1 to match the laptop who was successful at mounting them.

At this point my first guess is that the version of zfs your zpool were created on plays a part. But I'll have some more fun playing with it in a VM today now that I'm back online.

At least with the nas mounting the data, I was able to rsync it to the desktop as a one-off. I had a thread on reddit/zfs here but for now I've worked around the issue for myself.

Thank you for the PR @gamanakis.

ipaqmaster commented 2 years ago

Just adding my own tests which are consistent with above comments when using ashift=9 and ashift=12 where using 12 causes the problem.

The same archlinux usb stick was able to zfs recv the "broken encrypted dataset" and mount it perfectly fine using default zpool settings. It was only when I used ashift=12 in the zpool creation that the "Input/Output error" issue became apparent.

Testing conditions:

Linux 5.15.5 and zfs 2.1.1, then zfs 2.1.2

I booted the same archlinux usb stick I used to rebuild my desktop but in a qemu VM
I created a 100GB qcow2 image for it to use and attached that using VirtIO which presented it to the VM as a 512b sectored "disk".
I created a zpool 'zfstest' with no arguments (new zpool of default settings) on this virtual drive in the VM. Without specifying an ashift, zfs picked ashift=9 by default, which matches what the encrypted dataset was stored on originally.
I sent the same failing datset from my nas to this VM's qcow2 zpool.
It loaded the key and mounted just fine.

Then I did those tests again but for step 3 I included -o ashift=12 and got the IO error when trying to mount the received encrypted dataset.

Setting ashift=12 on zfs 2.1.1 and 2.1.2 causes the issue for me in a VM where /dev/vda was presented as a device using 512b sectors to the VM. Seems consistent enough. My laptop was able to read my desktop's encrypted root dataset because it's zpool still used ashift=9 like the encrypted root dataset it received and could mount.

gamanakis commented 2 years ago

In my tests it also happens with ashift=9 (default, checked with zdb) when raw sending back to the originating pool.

ipaqmaster commented 2 years ago

In my tests it also happens with ashift=9 (default, checked with zdb) when raw sending back to the originating pool.

I am curious, was your pool originally ashift=12 when it sent the snapshot away initially and is receiving it back as ashift=9 to experience the issue?

gamanakis commented 2 years ago

Without PR 12981 I cannot raw receive in the originating pool, regardless of the ashift, it throws an Input/Output error when mounting.

Let me try your case with 12981 applied.

gamanakis commented 2 years ago

Ok, this seems to be a different issue. Raw sending from pool1/encrypted with ashift=9 to pool2/encrypted with ashift=12 results to failure when mounting pool2/encrypted (Input/Output error).

I think you should open a new issue. I am not sure raw sending between pools with different ashift is possible, will take a look.

gamanakis commented 2 years ago

I opened #13067 for this matter, did some debugging there too.

openzfs / zfs