Tool for Emergency Master-Key Recovery

openzfs / zfs

OpenZFS on Linux and FreeBSD

https://openzfs.github.io/openzfs-docs

Other

10.41k stars 1.72k forks source link

Tool for Emergency Master-Key Recovery #15952

Open f1d094 opened 6 months ago

f1d094 commented 6 months ago

Describe the feature would like to see added to OpenZFS

A mechanism for backup and restore of the master key of an encrypted dataset

How will this feature improve OpenZFS?

When the encryption key is changed for a given dataset that key is also immediately changed for all snapshots. This is then propogated to any dataset which receives a snapshot as part of a backup process. Without a backup key, a malicious actor with sufficient access can potentially lock the active filesystem and all backups. This could also happen innocently, where an administrator manages to fat-finger a password twice, or has caps-lock on, or pastes incorrectly, etc. In any case the result is the same: Loss of the affected dataset and all backups.

A mechanism to recover the master key is essential to the safety of use of ZFS native encryption.

Additional context

Tom Caputi implemented much of the OpenZFS encryption codebase. In #6824, @tcaputi said

the data is encrypted with the master key and the the master key is encrypted with the user key which is in turn derived from the user's password / key material along with some parameters stored on disk. All of these are not saved in a snapshot, and instead live in the top-level meta object store (MOS). When the user does a raw zfs send, the send file includes some extra instructions that REPLACE these on-disk values on the target system. So after a raw zfs recv has occurred, the old values are no longer accessible. Because of this, there isn't currently a good way to backup these values in a way where they will be permanently safe.

If we can save these parameters in a separate file which can be stored securely, these parameters should always be usable to decrypt the master key as long as the user remembers the corresponding password. This could be used in break-glass kind of situations, but it also comes with a (manageable) security risk that anyone who can access this file AND the password has a permanent way to access the dataset.

Theoretically, though, the implementation shouldn't be that hard. We just need to save the parameters needed to derive the user key to a file and then allow the proper hooks into the kernel to allow those parameters to be used instead of the ones that are saved in zfs. Then that would need to be wired out into the command line interface.

In November 2023 I submitted a CVE for this vulnerability (https://github.com/openzfs/zfs/security/advisories/GHSA-5wqj-fcr9-j434) however it has not yet been reviewed. The text of that is not visible to the public until approved by OpenZFS Owners I will post a comment with the full text below.
This concept is already covered in #12649 but that request is much larger. Emergency key recovery could and should be a small simple tool and be made available ASAP vs significant overhaul of the encryption tools.

f1d094 commented 6 months ago

Below is a copy of my related CVE submission (https://github.com/openzfs/zfs/security/advisories/GHSA-5wqj-fcr9-j434)

Encrypted ZFS Volumes Vulnerable to Encryption Re-Key Attack

Summary

The lack of an offline backup mechanism for decrypting the immutable master key results in the potential for encrypted ZFS volumes to be weaponized. A malicious actor who gains privileged access can re-key available encrypted volumes, their children and snapshots. By extension, systems which then pull snapshots for offline backup will also have the associated volumes re-keyed. This can result in the complete loss of data on the related volumes or enable a perfect scenario for ransom.

Details

As a feature, any unlocked, encrypted ZFS volume, its children and all dependent snapshots can be re-keyed, via "zfs change-key -i". When snapshots are transmitted to offline backup systems, those volumes and systems also inherit the new key. The only requirement for re-keying is that the existing key is already loaded, that the user have permission to peform encryption operations (root account or granted via zfs allow), and that the new key is available. Any system that is up and running and relies on data within an encrypted ZFS volume will presumably have keys loaded.

Attack and escallation vectors that may result in the compromise of a system with elevated permissions are not a daily occurance but are common enough to be assured that even a system that is kept up to date with security patches will have windows of opportunities for this attack chain. The risk factor of a successful attack is the complete loss of not only the active host but also snapshots/volumes pulled from the affected host. For many these are the only data backups kept for member systems.

@tcaputi, a non-active but previously significant contributor to openzfs/zfs, indicates that this attack vector could be mitigated in #6824:

"...there is an answer here that wouldn't take too much work to implement. The current filesystem decryption code works using the user-supplied key material and the parameters saved on disk to decrypt the immutable master key. In the event of the attack described here, the stored parameters would no longer be available, but theoretically we could allow the user to save these parameters to a file that they could save securely on their own. Then they could decrypt the master key even if the on-disk encryption keys were altered via a new ioctl...

...it would probably be useful as a break-glass kind of solution."

PoC

Step 1: Host is accessed by an a malicious insider or compromised by unrelated attack vector with privileged permissions Step 2: Malicious actor uses zfs change-key -i to re-encrypt zfs volumes Step 3: Offline snapshots are pulled by backup system; those volumes are now locked using the malicious actor's key Step 4: Malicious actor unloads the active zfs-key, reboots host, or otherwise triggers affected volumes to lock Step 5: Ransom email is sent to affected parties and/or data is lost forever

Impact

All users utilizing encrypted zfs

norpol commented 6 months ago

I'm an outsider to this project, I've seen your original comments back then in the other threads.

@f1d094 in #6824:

[...] I am looking for someone to acknowledge that there is a vulnerability and approve the CVE. [...]

Personally I don't think this is reaching the threshold for a CVE frankly, this is a feature request really. I would believe this may only be a CVE if zfs would advertise to be protected against your described scenario, I have not seen this to be the case so far. Furthermore "Encryption Re-Key Attack" is not an actual term used for describing CVEs, actually if you google for "Encryption Re-Key Attack" you are the only result altogether and usually "rekey" is seen as a security feature. In ZFS case you could actually consider it as a security feature as well, since if your passphrase was compromised it is trivial to rotate the passphrase of local and remote backups/snapshots. Perhaps it should be documented better though.

Also you are assuming that backups are being created with the -w flag, without -w your described issue does not arise.

Emergency key recovery could and should be a small simple tool and be made available

I agree that would be a handy feature indeed.

A workaround for disaster recovery might to use the provided command of zstreamdump to extract the master key and , this might help you with disaster recovery with the support of the community or a professional. https://github.com/openzfs/zfs/issues/12649#issuecomment-1941318308 Though whether this would work in an incident is unclear and should be tested beforehand.

a malicious actor with sufficient access can potentially lock the active filesystem and all backups

This can always happen. The only proper protection for ransomware attacks is having actual physical offline backups, everything that is online can always be affected on infiltration. The definition of offline is that the storage is physically disconnected/airgapped - not reachable by other computers and can only be accessed by physical access. Please follow USA CISA Ransomware Guide. SOC2 and common DPIAs will likely require you to have such disaster recovery backups in place.

uses zfs change-key -i to re-encrypt zfs volume

change-key -i will not actually re-encrypt your zfs objects I believe, also explained here at "final note". Practically it comes close of course with a lack of tooling (though disaster recovery is sufficient).

Offline snapshots are pulled by backup system; those volumes are now locked using the malicious actor's key

This wouldn't be an offline backup any longer if it can "pull" new changes. Only if you are using zfs send -w, if you are doing backups with zfs send this will not happen, but requires the receiving side to re-encrypt, as documented for zfs recv.

Overall:

if you need protection for your described scenario, you are responsible for ensuring this protection yourself
the feature request is indeed valid and should be just tracked in #12649 alone, I would suggest closing this issue in favor of the already existing one
use other tools, such as restic, to protect against your scenario
use zfs send without -w, re-encrypt on the zfs receive side
I'd personally suggest to discard the CVE, since it does not appear to me a vulnerability per-se, but some use-case that isn't covered by the project
If you need this feature in a professional setting, try using gitpay.me or upwork to find someone work on this on your behalf

f1d094 commented 6 months ago

@norpol: I'm not getting the impression you have significant experience with network security or complex multi-user enterprise environments, certanly not with zfs. Or, for that matter, the general point of this issue.

"Encryption Re-Key Attack" is not an actual term used for describing CVEs

The name for any type of attack doesn't exist...until one day it does. Your inability to google a similar attack vector has no bearing. The name is arbitrary. The vector is very real. "ReKey" as a "security feature" is only a feature if you are the one to do the re-keying. Every tool is a "feature" until it is in the wrong hands. The problem here is that it has extreme knock-on effects. If I "rekey" your system without your knowledge, you are SOL. I hope your critical data is mighty static and your "offline backups" were recent. This vulnerability is acknowledged by @tcaputi, who is certainly an authority. If it should not be a CVE then how about we let the ZFS owners state as much?

Also you are assuming that backups are being created with the -w flag, without -w your described issue does not arise. use other tools, such as restic, to protect against your scenario

So, your point here is that no one should use native encryption, the -w flag exists for no valid reason, and that old-style copy-pasta backups are a good solution?

Please explain the following:

How do you backup your clients' data that are in encrypted zfs datasets that you do not have the key to?
Please describe any scalable deployment scheme that allows physical systems running on encrypted operating system disks, with secure, remote keyloading, and binary key encrypted datapools, and allows you to back them up such that you can fully restore the entire system to its exact operating state within a maximum 15 minute window of it's last operating state? Or, for that matter, any 15-minute window in the past month?

Your narrow concept of enterprise environments does not cover these use cases. I could go on, but the point is made.

if you need protection for your described scenario, you are responsible for ensuring this protection yourself This can always happen. The only proper protection for ransomware attacks is having actual physical offline backups...

What exactly is a "physical offline backup"? Do the packets to this magical datastore teleport to this location while the power is off with no networks connected? Your comments remind me of the age-old chestnuts: "If they're on your internal network you have bigger problems anyway" and "If the attacker gains root doesn't matter (so what's the point)".

A configuration using a secure, centralized backup system that has one-way inbound access to the various systems to be backed up, pulling snapshots which are then further exported to an offsite location certainly meets the metric for very strong disaster recovery...except for the vulnerability described herein. Remove that vulnerability and this setup is about 1000x faster, deeper, and more flexible than anything you can come up with "offline backup". Restic cannot compete with instantaneous block-level snapshots.

Being able to recover the master security key should have been baked in at inception and it is an oversight that it hasn't been made available so far.

everything that is online can always be affected on infiltration.

No. Not everything. Not always. Please try and break into our centralized backup system that has no ports open from anywhere but does outbound connections via ssh only. If you can write an exploit for openssh on a target system that results in a shell on an inbound connecting system, please call the NSA today, they need your skills immediately. Also: good luck.

A workaround for disaster recovery might to use the provided command of zstreamdump to extract the master key and , this might help you with disaster recovery with the support of the community or a professional.

So, for several hundred datasets, we should setup a mechanism to send them, zstreamdump each, and then somehow, hopefully, with the help of the community or "a professional" (what do you think we are?) maybe extract the key for each of these that will save our bacon in the future? This is not much of a disaster recovery plan.

If you need this feature in a professional setting, try using gitpay.me or upwork to find someone work on this on your behalf

This is the only thing you've said of merit. If I had the bandwidth I'd put together a PR myself. It is important enough that I will take some time and see what I can pull together in any case, but I suspect that based on prior comments by Tom that someone with actual experience with the codebase could put together a small tool with minimal effort.

Hence the separate, single-issue, issue. A small tool to fix this significant vulnerability can likely be produced quickly. With the ability to extract/save/use the master-key for a change-key operation, the entire attack chain goes away. The problem in #12649 is muddled and resolution is certainly a heavier lift, therefore a potentially long wait.

norpol commented 6 months ago

I apologize that my reply appears to have made you feel uncomfortable, but I'd also appreciate if you'd stay decent though and abstain from becoming too personal. Suggesting similarities to "age-old chestnuts", telling me that I have an "inability to google" or sarcastic/harsh side notes such as "Do the packets to this magical datastore teleport" or "please call the NSA today, they need your skills immediately. Also: good luck." should not be necessary to communicate your underlying needs.

Since there was not a lot of context for in which environment/threat model you are utilizing zfs encryption, of course, it is impossible to give tailored advice in a simple reply, I apologize for that attempt. I just thought it would help others to decide when to utilize what at the moment. Perhaps that context can support others to get a better understanding why this is important to you and might prioritize a bit more.

So, your point here is that no one should use native encryption

No. I just wanted to clarify in which situation the issue arises, since this wasn't covered by the CVE report, though I've now noticed that your original quoted part of your issue contains the phrasing "user does a raw zfs send" which actually covered this.

"So, for several hundred datasets, we should setup a mechanism to send them, zstreamdump each"

So I assume that is the situation you're dealing with, sounds cool.

what do you think we are?

Perhaps that is something you would like to elaborate on, I would have not expected that you are a senior C developer with familiarity in cryptography or the relevant zfs codebase from your previous statements. I apologize if that came over rude, though was certainly not my intention.

Anyway, I hope this issue can be dealt with soon. Unfortunately it appears to me that at the moment there is a lack of resources of the ZFS encryption parts.

I guess one possible workaround for the time being is to come up with some additional safeguards on the receiving end, such as using zstream dump to verify that keys haven't changed and then to continue processing the initial zfs send stream.

It is important enough that I will take some time and see what I can pull together in any case

Great, I'm sure everyone would appreciate that.

f1d094 commented 6 months ago

I apologize that my reply appears to have made you feel uncomfortable

@norpol: Nothing you have said makes me uncomfortable. Nothing to apologize for. I also was not trying to be personal. By "inability to google" was more a comment that whether or not there exists a single prior example in the whole of computing history is irrelevant...not your ability to find it. And yes, as someone who runs a penetration testing/red-team consultancy, I get a lot of pushback from well-intentioned but half-informed admins that have a very similar perspective, commenting on externalities that have no bearing at the specific topic at hand. I am generally brusque at distractions.

Since there was not a lot of context for in which environment/threat model you are utilizing zfs encryption, of course, it is impossible to give tailored advice

The context is simple and covered 100% in the detail of the CVE submission. I don't think there is any advice to give, unless you personally know how to extract-and-backup the master-key, and then use it to recover a dataset whose master-key is encrypted with an unknown key.

This isn't a forum for how to improve your enterprise deployment or pass random-government-regulation. Other backup schemes or why you might use them are not helpful and do nothing but muddy the topic.

Please review this comment from #6824 by Tom Caputi, which is very succinct and anyone interested in this topic should absorb:

Just to be sure we're all on the same page with the scope / current implementation (since there has been a lot of discussion). You are correct that the data is encrypted with the master key and the the master key is encrypted with the user key which is in turn derived from the user's password / key material along with some parameters stored on disk. All of these are not saved in a snapshot, and instead live in the top-level meta object store (MOS). When the user does a raw zfs send, the send file includes some extra instructions that REPLACE these on-disk values on the target system. So after a raw zfs recv has occurred, the old values are no longer accessible. Because of this, there isn't currently a good way to backup these values in a way where they will be permanently safe.

If we can save these parameters in a separate file which can be stored securely, these parameters should always be usable to decrypt the master key as long as the user remembers the corresponding password. This could be used in break-glass kind of situations, but it also comes with a (manageable) security risk that anyone who can access this file AND the password has a permanent way to access the dataset.

Theoretically, though, the implementation shouldn't be that hard. We just need to save the parameters needed to derive the user key to a file and then allow the proper hooks into the kernel to allow those parameters to be used instead of the ones that are saved in zfs. Then that would need to be wired out into the command line interface.

The third paragraph should pique the interest of anyone who this affects and the driver behind my creating this issue.

As far as my background:

I would have not expected that you are a senior C developer with familiarity in cryptography or the relevant zfs codebase from your previous statements

I started my career as a C developer and I am not Bruce Schneier, but I read and understand his work and can similarly wade through whatever codebase as needed. If my life depended on it, I could do the work. As it stands, much like everyone, I have other priorities and I will have to leave it to more nimble hands who have time to dedicate.

Until a tool can be created, any workaround which enables similar functionality would be welcome, regardless of complexity...just as long as it is 100% going to work. As far as I remember, there are no details are visible via zstream regarding the encryption keys but I'll double check ASAP. It would be tremendous if so, a great stopgap measure.

norpol commented 6 months ago

As far as I remember, there are no details are visible via zstream regarding the encryption keys

What do you mean by that exactly? I followed this https://github.com/openzfs/zfs/issues/12649#issuecomment-1941318308 and it is also on my local zfs pool available. Writing something that will first validate the key,iv,mac,salt looks as expected and then continues with zfs receive should be relatively trivial, no? Of course depends on how/what you are retrieving.

zfs send -w | zstream dump -d | grep -m1 -Po 'DSL_CRYPTO_MASTER_KEY_1.*'

f1d094 commented 6 months ago

What do you mean by that exactly?

I don't think I've ever reviewed encryption information using zstream. Clearly a failing on my part. Upon review of #12649 comments you mentioned I now see DSL_CRYPTO_HMAC_KEY_1 and DSL_CRYPTO_MASTER_KEY_1 etc...as part of the dump. This looks very promising and should be an effective and trivial workaround, if not particularly efficient. Thank you!

This does not eliminate the need for a tool, but an available workaround certainly lowers the urgency.

clhedrick commented 6 months ago

I would think a simpler solution would be for zfs receive to have an option -x encryptkey that prevented it from honoring requests to change the key. That sounds really easy to do, and sounds like it's well worth it.

f1d094 commented 6 months ago

I would think a simpler solution would be for zfs receive to have an option -x encryptkey that prevented it from honoring requests to change the key. That sounds really easy to do, and sounds like it's well worth it.

Absolutely. Being that it was trivial for me to implement a check externally based on @norpol's observation, it should be equally trivial to update zfs receive to have this option. I'd say it should be something other than -x, just because the relevant encryption key data aren't properties; but otherwise, yes. Maybe "zfs receive -K" ?

When I have a few mins I will create a separate issue for this...or, if you have the time @clhedrick, please do. Both of these features should be implemented but a "zfs receive -K" (or whatever) PR could be done just as trivially with the right hands I imagine.

f1d094 commented 6 months ago

I would still like to see a Disaster Recovery mechanism for the master key, so I will keep this issue open. There are many use-cases where this would be of value above and beyond 'evilhacker'. For instance, I would like to be able to delegate key-authority over several of our encrypted datasets various personnel. Unfortunately, we cannot risk any mistakes (because of said lack of DRP method) and therefore our keys must stay accessible/deployable only through our elaborate-but-safe system™ which is not friendly to multi-user access.

GregorKopka commented 5 months ago

Regarding the safety of backups using send|recv:

See https://github.com/openzfs/zfs/issues/5341#issuecomment-259642448 for possible ways to mess with backup servers.

There was also a discussion a good while back (sorry, was not able find it) about the possibility to completely destroy backup servers through feeding a manipulated datastream to the zfs recv running on them... by intercepting the zfs send (issued by the ssh from the backup system calling into the compromised machine to back up) and replacing the generated output with a doctored stream that contains carefully crafted garbage to nuke the target pool (or at least the received filesystems).

f1d094 commented 5 months ago

@GregorKopka I did not have time to read the entire thread you referenced (on phone at airport). Does your hypothetical scenario work where the snapshots are only generated by the receiving system?

To clarify: In our setup the source systems do zero snapshotting themselves. It is all done by the backup system which connects, makes the snapshots, normalizes them (gets rid of incompatible snapshots or panics and messages admin if something unexpected happens) and then pulls them, and never with zfs receive -F. (and now also doing a zfs send | zstream dump on both sending and receiving datasets to compare keys before proceeding)

I don't currently see a scenario in this configuration where property manipulation or other skulduggery on the sending system would result in actual data-loss in the datasets stored on the backup system.

Thoughts?

GregorKopka commented 5 months ago

Does your hypothetical scenario work where the snapshots are only generated by the receiving system?

TL;DR: yes A compromised host could intercept the calls to the zfs binary originating from the backup system and deliver problematic send datastreams - like a replication stream containing just an empty filesystem and the information to destroy all other snapshots on the target (that can trigger if the recv comes with -F) or has properties that will mount the supplied filesystem onto /root/.ssh (or other shenanigans, enabled by the path traversal bug in the logic determining the absolute mountpoint: https://github.com/openzfs/zfs/issues/13896).

Plus the (so far) theoretical option of delivering a stream that comes through the consistency checks on recv but contains enough garbage to mess up the target pool for good... maybe even a zfs send version of 42.zip (sending a compressed stream that instructs the target system to store huge amounts of data uncompressed) could be possible...? 🤷‍♂️

f1d094 commented 5 months ago

Interesting.

I don't see where the first couple would apply to our situation...we do not use receive -F anywhere and nothing is ever mounted on the backup systems, but the theoretical attack warrents review. A few questions:

If I wanted to test this, do you have any POCs you can share/point-to that allow for passing the consistency checks and still sending damaging garbage?
How would an unmounted dataset damage the hosting pool? (Or did you mean to say dataset?)
For each backed up dataset, if they live under a parent dataset that has a quota set, would your "data volume attack" circumvent that?

The only damaging attack I've been able to actually execute is changing the key...

GregorKopka commented 4 months ago

we do not use receive -F anywhere

Then it should be reasonably safe from the snapshots on the target side being wiped. Nevertheless, should your backup routine have snapshot expiry in place on the target side... that code could maybe tricked through delivering a high amount of new snapshots (all containing a now empty dataset) into doing that particular job for the attacker: destroy all old snapshots (that still hold the valid data) so only the empty ones are left... with some trickery this could be performed without actually destroying data on the source, which could then happen after the backup had been corrupted long enough.

f1d094 commented 4 months ago

@GregorKopka: This is a novel approach I had not considered. We do not pull empty snapshots but the general proof-of-concept still bears significant consideration. All of our logic resides solely on the pulling/backup hosts and this type of zeroing-out or zeroing-then-minorly-changing 1000x arbitrary snapshots should be trivial to plan for and prevent...but it needs to be mitigated ahead of time. Post-mortem would just be tears in the rain.

As it stands I believe our setup only pulls snapshots created by the backup system and any others on the target are destroyed. In theory someone should notice that 'critical data' has been changed and long before our own rules run the course to the point of being unrecoverable; otherwise the data probably wasn't that critical, n'est pas?