openzfs / openzfs-docs

OpenZFS Documentation
https://openzfs.github.io/openzfs-docs/
135 stars 194 forks source link

Proposal: Consider adding warnings against using zfs native encryption along with send/recv in production #494

Open AndrewJDR opened 8 months ago

AndrewJDR commented 8 months ago

Among experienced zfs users and developers, it seems to be conventional wisdom that zfs native encryption is not suitable for production usage, particularly when combined with snapshotting and zfs send/recv. There is a long standing data corruption issue with many firsthand user reports: https://github.com/openzfs/zfs/issues/12014 https://github.com/openzfs/zfs/issues/11688 (Also see the issues linked from those)

Additionally, if you join #zfs or #zfsonlinux on libera.chat and mention that you're having an issue with zfs native encryption, you'll be met with advice from developers that zfs native encryption is simply not reliable.

Should warnings be added to the sections of the documentation and/or the zfs command itself that mention native encryption that this combination of features (native encryption + send/recv) is known to be unsuitable for production usage? As-is, there don't appear to be any warnings, and it just seems inappropriate to guide new zfs users down a path toward potential data corruption, or even -- at best -- unscheduled reboots and scrubs. I have attempted writing a warning message below. This can of course be adjusted and is just here to get the ball rolling:

Begin message

ZFS has a known issue where using "zfs send" from an encrypted dataset may result in checksum errors being reported on snapshots within that dataset.

Please note:

  • In many configurations and workloads, this problem does not occur at all, even when "zfs send" is used from an encrypted dataset. Many are using zfs encryption along with zfs send without issue.
  • It is not understood precisely which hardware configuration, software configuration, and workloads cause this issue to manifest.
  • In some cases, when the issue occurs, the checksum errors can be eliminated by rebooting and scrubbing the affected pool twice, but it is not known with absolute certainty if this is always successful in eliminating the checksum errors.
  • In some cases, the issue can be avoided entirely by using a "raw" zfs send instead of an unencrypted zfs send, but it is not known with absolute certainty if this is always successful in avoiding the issue.

If you are considering using zfs encryption along with snapshot send/receive in use cases where unscheduled reboots and/or unscheduled scrubs are not acceptable, you may wish to thoroughly test your software and hardware configuration with your workload before putting it into production. If this is not practical, it may be best to explore other options for data encryption until this known issue is rectified. For more information, see the following github link: https://github.com/openzfs/openzfs-docs/issues/494

End Message

Update: I received some feedback that this was not well-substantiated enough. So for some additional context, here is a reddit comment from a zfs developer / contributor:

I have a strange little testbed next to me that reproduces one of the issues over 50% of the time you test it. Depending on which problem, sometimes this is "just" a kernel panic, sometimes it mangles your key settings so you need something custom and magic to let you reach in and fix it, sometimes it writes records that should not have been allowed in an encrypted dataset and then errors out trying to read them again. (To pick three examples.) (The illumos folks reported permanent data loss from what looks like a similar bug to one on OpenZFS, but that's not exactly the same code, so YMMV how worried that makes you.)

In addition, there is the constant stream of user reports in the issues referenced above.

I think there's already an understanding that this issue may be very difficult to fix, but in the meantime I'm just suggesting that it would be good if layman users such as myself had some documentation and zfs command level warning against using these features in production until this is resolved.

bill-mcgonigle commented 8 months ago

It says here:

For the time being i suggest you to make sure you don't create or delete snapshots while an unencrypted send is running. If you only do raw encrypted zfs sends, the problem does not occur.

It seems better to have Known Issues than general guidance to not use encryption, unless there are totally unknown causes to verified and unsolved problems. But, yes, docs for any feature should advise people to consult Known Issues when they exist.

owlshrimp commented 8 months ago

If there is functionality around encryption that is known to cause corruption, there really ought to be an unavoidable warning in the software to act as a catch. Not everyone reads all relevant documentation before executing each command, and it seems like issues with encryption have been around long enough for software warnings to be implemented (this isn't a freshly-discovered bug).

I know of plenty of people personally who presumably didn't see any warnings, just commands that went through cleanly, when turning on encryption and assumed that if it's shipping in ZFS it must be safe.

rbrewer123 commented 8 months ago

@Matthew-Bradley that's a great idea. Sounds to me like both the documentation and the software should warn against enabling ZFS-native encryption. This is how the ZFS community can show respect and courtesy to users, especially considering that ZFS-native encryption has been causing real headaches and time-loss for real people for years now.

h-2 commented 8 months ago

This is news to me. I assumed that zfs native encryption is not as fast as native encryption, but that it is stable.

Could someone more knowledgeable please clarify whether encryption is considered unsafe in general, or whether it is unsafe in combination with other features or usage patterns, and if so with which?

wdoekes commented 8 months ago

(Disclaimer: I'm not nowledgeable about zfs internals. But I am an experienced user.)

We've been running encryption on Ubuntu systems since it was available in the distro. We have never had any corrupt data.

Until recently, the only problem we had was with Ubuntu/Jammy and a missing patch, which caused snapshots to be unmountable. The patch (or a send/recv loop) fixed that: https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1987190

But, recently we did observe the send/recv issues that people are talking about. This did make snapshots that were recently made unavailable/useless, although the data does appear to be there (according to zdb). This is tracked in #15474 but it very much looks like #12014.

https://github.com/openzfs/zfs/issues/15474#issuecomment-1827832617 (wdoekes, Nov 2023)

Reading with zdb -r does seem to work: [...]

https://github.com/openzfs/zfs/issues/12014#issuecomment-860125077 (aerusso, May 2021)

In my case, https://github.com/openzfs/zfs/issues/11688 (which you already reference), I've discovered that rebooting "heals" the snapshot

https://github.com/openzfs/zfs/issues/12014#issuecomment-865040511 (jgoerzen, May 2021)

After a reboot but before a scrub, the zfs send you gave executes fine.

https://github.com/openzfs/zfs/issues/12014#issuecomment-1822764049 (J0riz, Nov 2023)

Rebooting the server and running two scrubs afterwards resolves the error.

Usage pattern that appears to trigger this issue:

So. Yes, I would prefer that the bug gets fixed. But if you're willing to put up with maintenance of the occasional snapshot failures - which might never happen - then I think you should be fine.

AndrewJDR commented 8 months ago

I'm going to try to steer things back on topic to the idea of adding some warnings about this feature.

  1. Even if these issues can always be worked around by rebooting and scrubbing twice (btw, you can find reports where this is not the case), the requirement of unexpected reboots and multiple scrubs rules out many production use cases, and so it would still make some sort of warning in the documentation and/or tools justified. You can see @wohali toward the later half of https://github.com/openzfs/zfs/issues/11688#issuecomment-1916910483 articulating this clearly. Forced reboots and scrubs are something we can all probably tolerate on home/test lab servers, not on production systems that have hundreds/thousands of people depending upon it 24/7 and where scrubs can take hours or days.

  2. Unfortunately, we also know that other sorts of issues exist with native encryption, because a zfs contributor has testbed reproduction cases that trigger: a) kernel panics b) corruption of encryption key data that requires special recovery that a non-zfs developer will probably not know how to perform. See my OP for more information on this.

AndrewJDR commented 8 months ago

I've added a draft of a warning message to the OP. It can of course be adjusted, but it is just there to get the ball rolling.

mabod commented 8 months ago

@rincebrain : In your reddit post you say that you are able to reproduce "one" encryption issue 50 % of the time on our test system. Which issue is that? And why cant it be further debugged if it is reproducible?

rincebrain commented 8 months ago

I've personally given up trying to fix native encryption issues after the project's continued refusal to acknowledge they fucked up by merging this. I do not have the energy to argue any more that introducing a 1% chance of lighting your shit on fire in a project that's supposed to be about "reliability" where one did not exist before is a catastrophic failure, or that "it's not technically data loss because someone could write tooling to recover it" doesn't really matter if you don't have the tooling in hand, that's still data loss from everyone else's perspective, or "snapshots can cause you to get IO errors if you're doing send/recv at the same time" is a sign of how badly this is broken.

Of course, as always, leadership will probably be along shortly to claim there's no issue, and that it would be bad PR to admit there's an issue, and that's why they won't warn people, like the last 2 or 3 times I've asked them to do this.

The reason the reproducer system I have is difficult to debug is that it's a little sparc box, and the race in question in openzfs/zfs#11679 is very finicky, so A) it being a sparc box means most of Linux's kernel debugging tools just laugh at you and don't run, and B) if you add too many debug prints, the timing gets less reliable, so you can't just get all the information you want out reliably.

h-2 commented 8 months ago

I'm going to try to steer things back on topic

I do think it is on-topic to try to document as best as possible when these issues occur. It would strengthen the case for putting up a warning and help readers of such a warning to make an informed decision.

owlshrimp commented 8 months ago

I do think it is on-topic to try to document as best as possible when these issues occur.

Agreed. There should be some clarity exactly *what* is being warned against.

Whether a specific warning/lockout is warranted when using certain features in combination with native encryption, depends on whether the problems can *truly* be narrowed to specific combinations. Right now is looks like the answer is No, with kernel panics and random data corruption in the mix. Even if that is disregarded, corruption with native encryption seems from this thread to impact multiple features across snapshots, send/receive, and scrubbing. In these latter two cases (panics and widespread impact) top-level warnings against enabling native encryption in both the documentation and tools must be part of the solution. Some tooling friction against enabling it may also be warranted.

A corresponding issue should probably be set up in https://github.com/openzfs/zfs/issues or similar to track changes to the tooling.

The internet at large has already picked up on this (I first became aware of it from a phoronix article). The best thing the project can do is put strong safeguards in place to stop the flow of people being bitten. People should be absolutely certain that they can't do something dangerous without at least running into a warning or error.

AndrewJDR commented 8 months ago

Absolutely agreed that having as much clarity as we can achieve is a good thing. If anyone has suggestions on tweaks for the warning message based on what you've learned, please chime in -- the draft is in the OP. I've tried to clarify it as much as I can, based on what I've been able to learn from the publicly available information. If someone wants to work in the testbed results from @rincebrain, that seems fine as well. I thought about doing it, but struggled with how to phrase it.

Personally, I think it's important not to make the message "too scary", because that can provide folks an opening to muddy the waters with comments like "Well, I've been using it fine for years!", which while not untrue, doesn't really help the many people that try it and run into the issues. This is why the current draft of the warning mentions that many have been able to use it without issue.

robszy commented 3 months ago

I think the only way is to fix issues if we have encryption merged so leaders agreeded to support it like there is support when there is a bug in code without encryption.

Otherwise there should be warning encryption is EXPERIMENTAL - no one supports it. Use if you are prepared to lose your data.