Proposal: State Encryption

StephanHCB commented 1 year ago

Suggest an existing issue in legacy Terraform to fix in OpenTF.

I already made a PR for full client side state encryption for all backends except the extended backends 2 years ago.

I would be more than happy to either submit it as a stale PR, or just make a PR. Whichever works better for you. I actually have a local patch based on 1.5.5 which I'd just need to commit.

Technical description / Proposal - see comments below - to be added here when at least some consensus is achieved and people have had a chance to comment.

Draft PR see in the comments below, but note that the implementation will likely differ from this PR. I have tried it extensively with state stored on Azure storage accounts, and it is known to work in this scenario. The PR includes documentation as an experimental feature and extensive unit tests.

I would love not to have to maintain an internal fork for this any more :)

cube2222 commented 1 year ago

@StephanHCB Hey!

We'd appreciate it a lot if you could resubmit this as an RFC on the OpenTF project, once the repo is open, as it seems you've thought this through a lot.

I did a rudimentary working PoC of this during the last weekend as well with pbkdf2 and aead, to better understand the problem domain.

Anyway, one idea we've had and discussed, was possibly using SOPS under the hood, as it already solves the use case of encrypting json/yaml files really well (and what is a state file if not that). The advantages would be

encrypting only the sensitive values has a lower chance of breaking state backends that actually try to parse your state (and I suppose there might be some that do), as opposed to sending an encrypted blob of the whole json
out of the box supporting all encryption backends that sops integrates with.

We could encrypt plan files, too, for safer storage if you're only applying at a later time.

I'd love to hear your thoughts on that.

StephanHCB commented 1 year ago

We'd appreciate it a lot if you could resubmit this as an RFC on the OpenTF project, once the repo is open, as it seems you've thought this through a lot.

Will do. I am happy this issue with terraform's security will finally get addressed.

I did a rudimentary working PoC of this during the last weekend as well with pbkdf2 and aead, to better understand the problem domain.

There are indeed a number of lifecycle events to consider:

initial encryption
key rotation
moving away from state encryption
switching state encryption providers

Then there are the manual operations that need access to the state, where tf state pull gets new, rare, edge cases:

state doctoring in the presence of a truncated state blob

Anyway, one idea we've had and discussed, was possibly using SOPS under the hood, as it already solves the use case of encrypting json/yaml files really well (and what is a state file if not that). The advantages would be
1. encrypting only the sensitive values has a lower chance of breaking state backends that actually try to parse your state (and I suppose there might be some that do), as opposed to sending an encrypted blob of the whole json

2. out of the box supporting all encryption backends that sops integrates with.

Generally, my solution supports pluggable encryption backends via a simple interface they have to implement, one implementation could very easily be one using SOPS. You select which encryption backend you want via a pair of environment variables, one for the current encryption configuration and one fallback used only for decryption (for key rotation, switching encryption backends, or altogether moving away from state encryption).

SOPS was brought up on the original issue. You've summarized the upsides quite well, the biggest downside was in my reply back then (https://github.com/hashicorp/terraform/issues/9556#issuecomment-1013958947):

"It's certainly a lot more complex than what this PR does, at least if you want to do it in a secure fashion. TF providers keep adding fields in the state, and if you miss one that's sensitive, you are again leaking credentials. This would have a high probability of giving a false sense of security IMHO, which is often actually very bad for security."

Where I work, this was one of the reasons we decided to encrypt the entire state.

We could encrypt plan files, too, for safer storage if you're only applying at a later time.

Good point. That is especially relevant for shipping / storing / processing plan files "in the cloud". On premises, there is less point to it, because you have to have access to both the encryption and the decryption keys anyway for every tf run.

0x91 commented 1 year ago

We could encrypt plan files, too, for safer storage if you're only applying at a later time.

This is only somewhat related but with plan files it's also worthwhile to look at solutions that prevent tampering as gaining the ability to modify a plan file is quite problematic as something executing OpenTF will generally have highly privileged access.

I'm not a cryptographer, but using AEAD should satisfy both the requirement to prevent sensitive values and ensuring the integrity of plans before applying them?

cube2222 commented 1 year ago

Alright, now that we have the repo in place I think we can continue the discussion on this.

@StephanHCB would you be able to rewrite the issue in a way that doesn't link to the Terraform repo? We're trying to keep any issues here self-contained and generally detach from the Terraform issue tracker.

Then I think we can discuss this a bit more here and possibly get an RFC going?

roni-frantchi commented 1 year ago

Hey @StephanHCB thanks for submitting!

To help us maintain a clear separation between opentf and hashicorp's offerings, we're asking that people describe issues that are in other repositories rather than linking those directly. I've thus scrubbed out any links to said issues.

We'd love to get an RFC going of course.

It is also worth mentioning that as an author of a PR to legacy your are permitted under legacy Terraform CLA to resubmit your code elsewhere as well.

Thanks!

ekini commented 1 year ago

For additional reference, Pulumi uses https://gocloud.dev/howto/secrets/ under the hood to provide this functionality. That allows to easily cover all major clouds and even keep a CMK in a different cloud to the state file.

StephanHCB commented 1 year ago

Alright, let's get started. Let's talk motivation and requirements.

The Case for State Encryption

OpenTF state contains lots of sensitive information.

The most obvious example are credentials such as primary access keys to storage, but even ignoring any credentials state often includes a full map of your network, including every VM, kubernetes cluster, database, etc. That is a treasure trove for an attacker who wishes to orient themselves in your private network.

Unlike runtime information processed by OpenTF, which only lives in memory and is discarded when the tf run ends, state must by its very nature be persisted. In large installations, tf state is not (just) stored in local files because multiple users need access to it. Remote state backend options include simple storage (such as storage accounts, various databases, ...), meaning these storage options do not "understand" the state, but there are also extended backends, which do wish to gain information from state. The persistent nature and (often) cloud storage of state increases the risk of it falling into the wrong hands.

Large corporations and financial institutions have compliance requirements for storage of sensitive information. One frequent requirement is encryption at rest using a customer managed key. From a security perspective, ideally such a key is not even made available to a cloud provider storing the state, for if the key is available and encryption is transparent, encryption at rest reduces to little more than a proxy for RBAC.

Requirements for State Encryption

Ability to encrypt

local state files
remote state
plan files

using a key supplied to terraform at runtime, either directly or through a key retrieval mechanism.

Off by default: Default to no encryption, thus maintaining full compatibility with existing tf versions.

The following lifecycle events need to be supported

initial encryption
moving away from state encryption
key rotation
switching state encryption methods

The following manual operations need to be supported

state pull with/without decryption
state push
state doctoring in the presence of a truncated state blob

Encryption methods should be pluggable to be able to support

incremental feature implementation (e.g. have a full encryption option out quickly and later add other methods)
full state encryption
partial state encryption
easy localized implementation changes to a single method
key retrieval from typical vault products
forks that just add a key retrieval / encryption method (we will never be able to support everything out there, especially some in-house solutions)

Partial state encryption should retain the structure of the state json and just mask sensitive values, so that the extended backends can still work with the state but not see any credentials, while full state encryption helps prevent information disclosure from state.

Encryption should be transparent

Encrypted state needs to be json, because many of the current backend implementations expect it to be. How much the json resembles the structure of the unencrypted state will depend on the encryption method.

Note: asymmetric encryption algorithms are of little use given the way tf needs to both read and write state in the same run for many operations. You will most likely always need a symmetric session key, but where that key comes from is a different matter. It is entirely feasible to have a state encryption method that obtains its key from some kind of key vault.

Any comments are welcome.

StephanHCB commented 1 year ago

Anyway, one idea we've had and discussed, was possibly using SOPS under the hood, as it already solves the use case of encrypting json/yaml files really well (and what is a state file if not that).

SOPS seems to fulfill a significant portion of the above requirements, so I think we should take a closer look at it.

I haven't worked with SOPS before, but just looking at their github repo I gather:

License is Mozilla 2.0, so that's good
It pulls in a pretty significant dependency tree - is there a process for vetting dependency licenses?
It exports a stable library interface for decryption, but not for encryption as far as I can tell. The idea seems to be that you use their editor to write config files, and use their library for decrypting them. This does not match our use case, do we have a good solution to that?

Do we already have an idea how we're going to decide which field values need encrypting? That seems to me to be the biggest hurdle we'd have to overcome, to not impose on the community a need to maintain a huge list of fields to encrypt.

StephanHCB commented 1 year ago

For additional reference, Pulumi uses https://gocloud.dev/howto/secrets/ under the hood to provide this functionality. That allows to easily cover all major clouds and even keep a CMK in a different cloud to the state file.

This is also a very promising candidate, at least for the full state encryption use case.

License is Apache 2.0
Pulls in a pretty significant dependency tree - same question as for SOPS: how to vet them?
Supports GCP Key Mgmt, AWS Key Mgmt, Azure Key Vault, Hashicorp Vault and local keys
Used by Pulumi for a very similar purpose, so it's real world tested

Personally this will be the first library I'll look at for a prototype implementation.

StephanHCB commented 1 year ago

How should state encryption be configured?

In my previous implementation, I introduced two environment variables:

TF_REMOTE_STATE_ENCRYPTION: configuration used for encryption and decryption
TF_REMOTE_STATE_DECRYPTION_FALLBACK: fallback configuration for decryption, tried if decryption fails with the first choice

You set each of these to a json document with two fields

implementation: select a state encryption method by name
parameters: configure the state encryption method

I used environment variables because those do not show up in process lists, and also can be centrally provided for all users on a machine. Also, works with terraform wrappers such as terragrunt with no code changes, because environment variables are "just present".

Turns out this is also a good choice for CI systems, works with Spacelift for example, you can build it so the key never leaves your local private worker deployment.

I will create a draft PR with my old code for everyone to look at, hopefully I can get around to that on Monday, although with the additional requirements I doubt the end result will actually include much of that code other than the interface for pluggable state encryption methods.

cube2222 commented 1 year ago

It exports a stable library interface for decryption, but not for encryption as far as I can tell. The idea seems to be that you use their editor to write config files, and use their library for decrypting them. This does not match our use case, do we have a good solution to that?

Yeah, I think SOPS would need to be used by shelling out to the CLI, not great.

I do agree that the Go Cloud library looks very promising here, esp. since Pulumi is using them already.

Do we already have an idea how we're going to decide which field values need encrypting? That seems to me to be the biggest hurdle we'd have to overcome, to not impose on the community a need to maintain a huge list of fields to encrypt.

The schema contains info about which fields are sensitive, it's currently also included in the plan file, so in other words the info should already be there.

How should state encryption be configured?

That's a good question! I'm not sure, open to hearing more opinions here. Environment variables are the obvious solution, but we could also think about .terraformrc or even a custom file in the .terraform directory in the pwd (similar to how "current workspace" is stored). Obviously that would most likely not include the secret itself, if there is one (for the most simple implementation that involves a passphrase), just config details like the KMS key id. Secrets I believe shouldn't be stored in any plain-text config files.

I will create a draft PR with my old code for everyone to look at, hopefully I can get around to that on Monday, although with the additional requirements I doubt the end result will actually include much of that code other than the interface for pluggable state encryption methods.

Sounds good, and thanks a lot for this! Btw. no rush, I don't think we'll be able to merge this prior to the 1.6 stable release, which is now our main priority. A PoC would however be great grounds for further discussion.

cube2222 commented 1 year ago

I skimmed the draft and it's looking good so far @StephanHCB, I'll do a more thorough review tomorrow.

One thing I noticed is that the AES encryption I think can be simplified by using a couple of primitives from the Go stdlib (though I'm admittedly not an encryption-expert).

I have done a PoC of this 2 weeks ago (apologies for not posting it earlier, I should've done that when the repo went public) and I think we don't need to do authentication manually + we can use pbkdf2 to make the key arbitrary (rather than requiring users to provide the long random key), here's a very brief diff of my PoC which includes these: https://github.com/opentffoundation/opentf/compare/main...kubas-state-encryption-experiment

It's by no means a strong opinion, just something I noticed when taking a glance, I may be wrong about this.

StephanHCB commented 1 year ago

I have done a PoC of this 2 weeks ago (apologies for not posting it earlier, I should've done that when the repo went public) and I think we don't need to do authentication manually + we can use pbkdf2 to make the key arbitrary (rather than requiring users to provide the long random key), here's a very brief diff of my PoC which includes these: main...kubas-state-encryption-experiment

I like how your PoC keeps the implementation less complex by just extending Client. Will have to play around with that idea to see if this approach can be generalized to include local state files and plan files. That was part of the reason I pulled it out into its own interface, even though I never got around to implementing encryption for local state files.

I also like being able to generate the encryption key from a passphrase. Perhaps we should look into separating the step that obtains the key from the step that actually performs the encryption.

I do think we will need to ensure the encrypted state is encoded into json, because there are some backends that expect it to be. That was the reason I opted for hex-encoding into a very simple json structure. In practice it has also turned out helpful that state is text rather than binary when copying state around.

bgshacklett commented 1 year ago

I'm concerned that this proposal jumps directly to a specific solution (encryption), rather than having further discussion about the root problem, which is storing sensitive values in state, period.

I don't have any knowledge about SOPS; it may be a fantastic tool. I do know, however, that as soon as encryption is brought into the mix, things become significantly more complicated. Key management with a single workstation is problematic enough. Moving that into CI/CD tooling and handling team access makes that all the more painful.

Additionally, if someone has access to run plans against the state file, encrypted or not, they have access to anything stored in it. This means that they potentially have access to secrets which they have absolutely no business having, purely because they're standing up the infrastructure. Depending on what regulations a company falls under (or even partner agreements), this can be a very real problem.

Certainly there are solutions that can be put in place to mitigate that concern, but they all add additional complexity, both with tooling and restrictions to workflows. We're talking about a class of tools which is supposed to enable us to do our jobs, not create additional hindrances. Anything that puts additional restrictions on how the tool can be used should be met with great skepticism.

I still haven't seen a good argument as to why these values need to be stored at all. Certainly I've heard that existing API contracts with providers have this requirement, but nothing that I've seen or read says that this is a fundamental requirement of the problem domain, only that it's a difficult problem to solve. The OpenTF fork presents a unique opportunity to fix this original design mistake. It might be difficult, but I fear that adding an additional band-aid of encryption over the top of a flawed design is likely to solidify it for the foreseeable future. Eventually another tool will be available which doesn't have this problem, and it will be a real differentiator if OpenTF still does.

Ozzard commented 1 year ago

What's the threat model - who are we defending against here? All of my state is valuable to external attackers, for example, as it gives a snapshot of my infrastructure without them having to breach it. Therefore it's all sensitive given that threat model. Full encryption is perfectly reasonable, but I'd suggest that it's not a complete solution because avoiding credential storage would provide further benefits and mitigate different threats in the model.

StephanHCB commented 1 year ago

I'm concerned that this proposal jumps directly to a specific solution (encryption), rather than having further discussion about the root problem, which is storing sensitive values in state, period.

[...]

I still haven't seen a good argument as to why these values need to be stored at all. Certainly I've heard that existing API contracts with providers have this requirement, but nothing that I've seen or read says that this is a fundamental requirement of the problem domain, only that it's a difficult problem to solve. The OpenTF fork presents a unique opportunity to fix this original design mistake. It might be difficult, but I fear that adding an additional band-aid of encryption over the top of a flawed design is likely to solidify it for the foreseeable future. Eventually another tool will be available which doesn't have this problem, and it will be a real differentiator if OpenTF still does.

I certainly agree that state should not contain sensitive values.

I do not agree that sensitive values in state are the only problem state and planfile encryption addresses.

The opentofu manifesto says (https://opentofu.org/manifesto): "[...] Backwards-compatible - so that the existing code can drive value for years to come" Meaning, at least in the short term, there needs to be a very good reason to introduce incompatible changes.

Your proposal is of a much larger scope, and has a much longer time scale for implementation, and frankly I do not see how it can be implemented without breaking the promised compatibility. If you ask me, your proposal should become a separate Rfc, with much larger consequences, resulting in changes to a lot of providers.

Transparent, off-by-default, state encryption provides a short- to mid-term solution that can be implemented in a way such that it does not break compatibility.

I do know, however, that as soon as encryption is brought into the mix, things become significantly more complicated.

That is why it's off-by-default. If you don't like the complexity, you can just not set the environment variables (or whatever configuration option it'll be in the end), and everything will work exactly as before.

Ozzard commented 1 year ago

Your proposal is of a much larger scope, and has a much longer time scale for implementation, and frankly I do not see how it can be implemented without breaking the promised compatibility. If you ask me, your proposal should become a separate Rfc, with much larger consequences, resulting in changes to a lot of providers.

I think there are ways, the simplest being much more use of environment variables in planning and running plus a lookaside file that's OpenTofu-specific about which values for which providers should be substituted in from run's environment and not kept in the state. That's then a general censorship mechanism. But, IMO, it's not this change.

StephanHCB commented 1 year ago

For additional reference, Pulumi uses https://gocloud.dev/howto/secrets/ under the hood to provide this functionality. That allows to easily cover all major clouds and even keep a CMK in a different cloud to the state file.

Personally this will be the first library I'll look at for a prototype implementation.

I am now at the point where I was able to take a closer look at https://gocloud.dev/howto/secrets/.

Reading the documentation, we learn that

The secrets keeper API is designed to work with small messages (i.e. <10 KiB in length.) Cloud key management services are high latency; using them for encrypting or decrypting large amounts of data is prohibitively slow (and in some providers not permitted).

Indeed, Pulumi does not encrypt the actual state using this library, it only uses it to derive a regular AES256 session key, obtained from various Vault products, then the actual encryption step is done with built-in AES-GCM primitives.

My next step will be to add the key derivation functions provided by https://gocloud.dev/howto/secrets/. This will add support for GCP Key Mgmt, AWS Key Mgmt, Azure Key Vault, and Hashicorp Vault.

I've already got pbkdf2 (based on @cube2222's work), aes256 and the stacking mechanism working. The code will need some polishing and cleanup, and test coverage is currently incomplete, but that's for the time after the PoC.

marksisson commented 1 year ago

Wanted to callout a few things regarding SOPS that could be really beneficial to OpenTofu.

SOPS is accepted to CNCF in Linux Foundation at sandbox maturity level
SOPS uses envelope encryption (i.e., like Pulumi as stated in previous comment)
SOPS has easy key rotation
SOPS has a client/server architecture
SOPS has auditing
SOPS supports Shamir's Secret Sharing for information-theoretic security
SOPS can be consumed as Go module to integrate SOPS with OpenTofu (decrypt package has stable API and other packages don't seem to change too often) - instead of shelling out to SOPS binary

StephanHCB commented 1 year ago

Format of encrypted state

(this comment has been moved to subsection "Format of Encrypted State" of the Technical Description section of the RFC draft, see below)

StephanHCB commented 1 year ago

For everyone's reference, here is the WIP of a RFC I am currently working on. I'll keep updating this comment as I find time to write more. I think I am at the point where I can write it all down now.

@cube2222 should I just edit the initial post once I am done or how do we retrofit this issue to become an RFC? I'd like to not lose the excellent discussions.

MOVED

(this comment contained the draft version of the RFC. It was moved to #874 in its entirety and will be maintained over there)

cube2222 commented 1 year ago

This is excellent @StephanHCB! I didn't yet have the chance to review it, but will do so as soon as I have some time.

Ideally, please create a separate RFC that references this issue. The idea is that this issue is tracking the general "state encryption feature" enhancement request, while your RFC will be a concrete implementation proposal. Both issues will stay, so no discussion will be lost.

StephanHCB commented 1 year ago

I have implemented a roundtrip encryption example with SOPS and pushed it on a separate branch.

In summary, you can tell that this is not currently an intended use case for SOPS.

Encrypting just the sensitive values is not really feasible. The field include/exclude selection works by Suffix or Regex only, while we have a list of paths that is located in the document, and the selection logic is in non-exported functions. Perhaps, we could construct a long regular expression that matches all the field names we want to encrypt, but that is just error prone and probably slow. You can encrypt all the leaf values, though.
For encryption, you have to manually construct encryption metadata, including the key retrieval data structures, because unfortunately this part of the code is kept private in the cmd implementation.
SOPS insists on external key management (one of the 4 vaults, openpgp, or age), no exported functions with support for just giving it the key which you have obtained in a seperate step. To make that work, you have to fudge metadata for decryption in a rather ugly hack, again because the functions you'd need are all there, but not exported.
The lack of exported functions for the individual steps involved means automated testing has to be done at a high integration level. Which is not the fault of SOPS, they aren't saying they are a library for encryption and decryption. Nevertheless, the contrast to gocloud-secrets couldn't be more drastic.
The resulting code does not look very maintainable, even if we were to accept that local key management has to be done with openpgp or age, which I personally find unnecessarily complex for our use case. The code that results for encryption in this case is commented out in the commit I pushed. The code looks better, but now you suddenly have to do e.g. openpgp key management outside of OpenTofu, and explain that to users of the state encryption feature.

This situation may improve somewhat in the future. See:

https://github.com/getsops/sops/issues/903 (July 2021)
https://github.com/carlpett/terraform-provider-sops/issues/73
https://github.com/getsops/sops/pull/1174 (especially the recent discussions from Aug 2023 regarding how to set up KeyGroups)

cube2222 commented 12 months ago

@StephanHCB Finally got some time to properly review. Great work overall!

I've read the RFC and everything makes sense to me. It looks like you've covered it very extensively.

I have no strong opinion regarding SOPS and just using gocloud secrets makes sense to me.

Regarding the sensitive_fields array, @Yantrio knows those details very well, so should be able to provide feedback here.

I was thinking whether the encryption configuration should indeed be a json object passed as an env var, rather than maybe an encryption.hcl file in the project root. However, it seems that using an encryption.hcl file would be less transparent and could result in edge cases with tools like terragrunt, which might be running tofu in different directories.

One topic I think is worth discussing in the RFC is the terraform_remote_state datasource, and whether we'd want to support a decryption config there. I believe the answer to be yes. However, since it's a built-in datasource, adapting it should be fairly trivial.

StephanHCB commented 11 months ago

One topic I think is worth discussing in the RFC is the terraform_remote_state datasource, and whether we'd want to support a decryption config there. I believe the answer to be yes. However, since it's a built-in datasource, adapting it should be fairly trivial.

I like this suggestion! Not sure yet about the precise details yet, the terraform_remote_state data source is not where one normally configures remote state storage, I'll have to try out where exactly to best place it, but there are obvious benefits:

It is easy to keep backwards compatible if not used.
It actually also works with terragrunt, because the remote state backend configuration is almost always injected like this, though at the cost of having to make a change to the code in a single central place.
It allows the user to check in their encryption configuration with their code, as long as they are using a key retrieval system, or just live with checking in the key.

I have moved the RFC to its own issue #874 as you suggested.

anjmao commented 6 months ago

Maybe will be useful for someone.

I have terraform state pointing to local file.

terraform {
  backend "local" {
    path = "../../vault/local/terraform.tfstate"
  }
}

I'm using https://github.com/FiloSottile/age to encrypt state file which is stored in source control too.

vault-encrypt:
    tar cvz ./vault/local | age -r id123 > ./vault/data.tar.gz.age

vault-decrypt:
    age --decrypt -i ~/.age/private-key ./vault/data.tar.gz.age | tar zxf - -C .

opentofu / opentofu

Proposal: State Encryption #297