rwblokzijl commented 8 months ago

Proposal for minimizing/removing secrets from the Terraform state file

With the additional benefit of low privilege Terraform plans.

Background:

Terraform is a game changer for infrastructure management, the ability to maintain everything as code is a godsend and our entire organization is using it to manage thousands of resources across 10 teams.

During our 3 years of setting up this infrastructure, we've noticed one issue that has been hard to reconcile: The fact that Terraform is "all powerful" whilst it is the only acceptable way of changing our infra. These two concepts are constantly at odds with each other. But it doesn't have to be this way.

terraform apply is really the dangerous operation. It affects the infra, it can delete crucial resources, it has "write" permissions. terraform plan however, does not do anything like that. It just does a bunch of GET requests, compares to the state, and proposes changes, it has "read" permissions.

This mean terraform plan should be runnable by any engineer at our company. They can make changes to the Terraform code on their git branch, run terraform plan locally, evaluate their changes, iterate, and eventually create a pull request, that runs another terraform plan, to be reviewed by the gatekeepers of our organization. Works great, everyone can develop with velocity, and the security is maintained.

However, its not that simple. There are a few issues that make terraform plan nearly as privileged as terraform apply. Resources contain secrets in their configurations. These secrets need to be read by terraform plan from a datasource API and later during Terraform refresh read again from the resource API for change detection. This means the underlying accounts needs permissions to read secrets from these APIs. Additionally, these secrets are then kept in the Terraform state, which is also readable by terraform plan.

During plan we'd want all operations to be as little privileged as possible, this way all engineers can develop and run `terraform plan`` without needing access to any secret values that could be misused for "write level" operations later.

Proposal

Please consider the following proposal for fixing this:

Read operations defined by the providers could be split between "plan" and "apply" time reads.

On plan time reads: Resources and datasources would have the option to omit the secrets and return metadata instead which uniquely identifies the secret value
On apply time reads: Resources and datasources would return the same data with the secrets included

This would require a change in the contract between Terraform and its providers. And 1 change to the way Terraform calls the providers during "read" operations.

Backwards compatibility should be easy enough, if the provider does not define separate plan and apply time read operations, just the plan time read operation would be called.

The information returned during the apply time read would have to be consistent with the earlier plan time read. However this should solvable possible in most situations. (eg. Metadata or a unique version identifier). Also there could be a check that all the other fields haven't changed in the meantime.

This also means the underlying platforms and APIs should support separate metadata and read permissions. This is already supported by eg Vault and GCP secrets.

An example:

A datasource "vault_kv_secret_v2" is used to configure a resource "google_secret_manager_secret_version".

The 2 read operations will be referred to as "plan_read" and "apply_read"

In this example, both these resources have been updated to use the following endpoints:

vault_kv_secret_v2 plan_read: https://developer.hashicorp.com/vault/api-docs/secret/kv/kv-v2#read-secret-metadata
vault_kv_secret_v2 apply_read: https://developer.hashicorp.com/vault/api-docs/secret/kv/kv-v2#read-secret-version
google_secret_manager_secret_version plan_read: https://cloud.google.com/secret-manager/docs/reference/rest/v1/projects.secrets.versions/list or https://cloud.google.com/secret-manager/docs/reference/rest/v1/projects.secrets/get
google_secret_manager_secret_version apply_read: https://cloud.google.com/secret-manager/docs/reference/rest/v1/projects.secrets.versions/get

Creation Plan (using low privileges):

Terraform calls plan_read of vault_kv_secret_v2, which will return only the metadata.
Terraform plans a create action and includes it in the .tfplan. This will include a reference to vault_kv_secret_v2 and the metadata (no secrets)

Creation Apply (using higher privileges):

Terraform calls apply_read apply_read of vault_kv_secret_v2, which will return the secret data
Terraform creates google_secret_manager_secret_version using the metadata
The Terraform state is updated with the new resource, which instead of the secret value, contains the metadata returned by the endpoint, as well as a reference to the datasource that is the source for the secret.
For the datasource the metadata is also stored.

Update/Read Plan:

Terraform calls plan_read of vault_kv_secret_v2, which will return only the metadata.
Terraform calls plan_read of google_secret_manager_secret_version, which will return only the metadata.
Terraform compares the metadata of both to the known metadata in the state
If the metadata is the same Terraform does nothing with it and continues.
If the metadata is different Terraform will include changes for all dependent resources in the state with a value that won't be known until apply.

Update: Apply

Terraform calls apply_read apply_read of vault_kv_secret_v2, which will return the secret data
Terraform updates google_secret_manager_secret_version using the metadata
The state is updated with the new metadata

Our current workarounds

We let all applications that can, fetch directly from our vault. This meant Terraform does not need read permissions at all.
We set up "proxy" serverless functions, that are privileged to read from vault, and configure the secrets in the destination. terraform apply has permissions to grant them access to the vault secrets, without having access itself. terraform plan only has permissions to view the permissions of the serverless functions, not the secrets themselves.
We don't generate secrets with Terraform, or accept the fact they will be visible in the state. We experimented with serverless functions or null_resources to generate these. But this caused issues in the resource dependencies since the secrets aren't available yet at plan.

Expansions on the idea

Secret generation currently done by the "tls" or "random" providers could be done by secret providers such as HashiCorp Vault.
Resources that don't support returning the metadata, but don't return the secrets could just be "assumed" to not be updated in the background. (depending on provider implementation or lifecycle.ignore_changes)
Resources that don't support returning the metadata, and do return the secrets could be "assumed" to not update at all (including non-secret fields). This would require Terraform to add a "refresh=false" flag to the lifecycle.

Additional benefits of apply time and read time fetches:

Another issue to setting up lower privileged plans is that provider configurations could depend on a datasource for authentication eg google_service_account_jwt for accessing vault in our GCP pipelines. If these values could update between plan and apply, it could be swapped for a higher privilege token on apply. In addition to the token being fresh.

https://github.com/hashicorp/terraform/issues/32100

crw commented 8 months ago

Thanks for this feature request! If you are viewing this issue and would like to indicate your interest, please use the 👍 reaction on the issue description to upvote this issue. We also welcome additional use case descriptions.

I've sent this to our product manager for review as well. Thanks again!

jbardin commented 8 months ago

Thanks for the write-up @rwblokzijl!

This sounds more like a proposal to help remove sensitive information from the plan, which is definitely a valid concern. In the current architecture, delaying the read of secrets until apply would still end up with the same data stored in the state. Just to clarify some things here while other research is in progress, points which can't be addressed by these changes alone are as follows:

There is not yet a language-level definition for "secret", which is something that needs to be codified first. We can't only make use of attributes marked as "sensitive", because providers can't account for the change in protocol when they have already defined many sensitive fields. Many existing sensitive fields need to maintain the current workflow to not break their functionality.
Secrets as commonly conceptualized are often required for managed resources, which must currently store the value in the state regardless of whether it's provided during plan or apply. A new concept for the type and handling of the data would need to be created to prevent this storage (some research has been done in this area already, but it's not easy to fit in the existing language and protocol), or maybe they aren't allowed within managed resources.
Many examples of providers and resources require their secrets in order to successfully create a plan; delaying the reading of secrets until apply will break these use cases.
Only removing the secrets sometimes (i.e. making exceptions when they are needed for planning) means they can still appear in the plan. A solution which does not entirely remove secrets from state is not an improvement in security

The idea that apply is an action which requires elevated privileges is interesting. That might not be a complete solution for the CLI tool on its own, but within a managed workflow it might be part of a larger solution for secrets management.

hashicorp / terraform