hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.06k stars 4.12k forks source link

Allow Vault to auto-unseal restarted instances when in High Availability mode. #16419

Open ksa-real opened 1 year ago

ksa-real commented 1 year ago

Is your feature request related to a problem? Please describe. Typically node maintenance for critical nodes is done one by one. A node becomes sealed after the maintenance and requires either manual unseal (inconvenient to do all the time), reliance on an external provider with no current way to manually unseal, or another vault cluster which also has the same problem and requires unseal during node maintenance.

Describe the solution you'd like Assuming most of the nodes are still unsealed (say 2/3 or 3/5) It would be nice if alive unsealed vault nodes could unseal the sealed one(s). So, initially, nodes are unsealed with Shamir's keys, and then the cluster auto-unseals itself if there are enough nodes.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Not sure if this has been already considered and if some roadblocks found.

6046 seems to be related. People asked for multiple recovery keys, which may be of different types (Shamir, KMS). Here we would likely need the same: Shamir + transit-like.

maxb commented 1 year ago

Whilst I can see how this would be useful, it seems like it would weaken the security model.

The problem is, how can the remaining unsealed nodes, successfully authenticate that the restarted node is legitimate.

Today, a Shamir seal protects against a single admin slipping a modified Vault binary that allows them extra access onto a node, and restarting Vault to use it. (Whereas, auto-unseal does not, since the node can just auto-unseal.)

This protection would be lost if a returning node could reacquire the key material from other cluster nodes.

There are other attacks of concern too - e.g. shutting down an existing Vault node and starting up a new compromised host at the same IP address.

I can theorise that perhaps a Vault cluster could - optionally - maintain a short lived encryption key and a short lived authentication secret on disk on each node, and unsealed nodes could be willing to disclose the root key, encrypted with the short lived key, over the network, if the authentication secret was presented. I think this would roughly equate to a kind of time-limited cluster-internal transit unseal.

The question, though, is whether HashiCorp would want to entertain that level of additional complexity in critical security code, to support a lesser security level for customers who wanted a different security/convenience tradeoff.

Or, perhaps there is a better way than what I have conjectured above.

Meanwhile, there are a some possible compromises you can make today to create different unseal behaviours using the currently available building blocks:

1) You can just write the unseal key to disk unencrypted on the Vault nodes, and have a local wrapper script to inject it at startup. Very insecure, of course, but since we're discussing security for convenience, it seems worth mentioning.

2) You can create a super-secure service of your own, and give it enough Vault key shares to unseal Vault, which it transmits to the Vault API on request, subject to the business logic of your choice. It's no small undertaking, obviously, but it does open up the option for you to implement whatever rules you want, around disclosing the unseal key.

ksa-real commented 1 year ago

Sorry, as my questions may be naive. I don't have enough knowledge about Vault. So just trying to make some guesses. The documentation seems to be much focused on operational procedures with less focus on what happens underneath, what are the crypto suites used, and what are attack vectors the solution mitigates. For example, I don't understand why a chain of keys was used (root, encryption), as they are all stored with the data. My guess is that root is unencrypted only briefly to unencrypt the encryption key, and then the encryption key is held in memory unencrypted. Or maybe the encryption key is the same on all instances while encryption keys are different, but don't see why it would be needed. Or maybe there was a desire to rotate root keys without rotating the data. Not sure what the benefits would be though, as both keys likely have the same security strength.

So, we are trying to make the usage more convenient but not make Vault less secure.

Today, a Shamir seal protects against a single admin slipping a modified Vault binary that allows them extra access onto a node, and restarting Vault to use it. (Whereas, auto-unseal does not, since the node can just auto-unseal.)

Currently, the issue with the auto-unseal is that if something is wrong with the unseal authority, there is no fallback to something controlled by owners (e.g. Shamir keys holders). Obviously, there is a trade-off currently between using Shamir and auto-unseal. But if both methods were available at the same time, IMO the security wouldn't decrease. So we are comparing the proposed approach with the current auto-unseal.

The issue with transit engine auto-unseal is that engines won't auto-unseal each other, because one engine should be up and running to unseal the other, the first one couldn't have been unsealed with the second one. So the only solution to this chicken-egg problem I see in multiple unseal options available at the same time.

If an attacker has root access to the node, it can access RAM and retrieve the encryption key. So, we assume admins either don't have this possibility or are not considered a threat. As for tampering with the vault binary, what prevents copying the data and auto-unseal auth configs/secrets and using it with the new vault binary?

Auto-unseal is currently assumed to be an acceptable option. There is enough unencrypted data stored with the vault to authenticate with remote HSM. So it wouldn't be worse if a vault instance stores unencrypted data to authenticate with another running vault instance. If it is desirable for instance to not have its own unseal key in the memory, we may have different unseal keys for different nodes. Basically, the idea is rough but probably doable unless we found why not.

The question, though, is whether HashiCorp would want to entertain that level of additional complexity in critical security code, to support a lesser security level for customers who wanted a different security/convenience tradeoff.

I think it is first required to show there is really a lesser security level.

Meanwhile, there are some possible compromises you can make today to create different unseal behaviours using the currently available building blocks:

  1. You can just write the unseal key to disk unencrypted on the Vault nodes, and have a local wrapper script to inject it at startup. Very insecure, of course, but since we're discussing security for convenience, it seems worth mentioning.

  2. You can create a super-secure service of your own, and give it enough Vault key shares to unseal Vault, which it transmits to the Vault API on request, subject to the business logic of your choice. It's no small undertaking, obviously, but it does open up the option for you to implement whatever rules you want, around disclosing the unseal key.

The first one renders away the encryption at rest. So, definitely not an option. The second one is probably not an option the way you mentioned it, as it is a single point of failure and potentially exposes a controlling set of Shamir keys to the single operator, or it is a vault by itself.

One option I considered is a small very simple auto-unseal service that can hold a single Shamir key in memory. This would allow Shamir key holder to delegate their key to the trusted environment, controlled only by them. They start an instance, e.g. at their homes or on some different hosting providers, i.e. in the environment they have a high level of control and belief that multiple places won't be compromised at the same time. The service can do polling or some sort of an unseal request can be made from the vault cluster. A key holder can stop delegation by simply turning down the service. It is also ok for a single instance to be temporary down if there are enough other instances. But again, the preferable way would be auto-unseal with cloud provider with some sort of fallback possibility.