hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.34k stars 9.49k forks source link

Soft state locking #30277

Open deberon opened 2 years ago

deberon commented 2 years ago

Current Terraform Version

N/A

Use-cases

When designing deployment pipelines, it would be nice to have the ability to completely lock the state between a plan and apply stage. That way no other modifications to the state can be made while a change is pending.

Attempted Solutions

I looked for ways to manually lock the state and didn't find anything apart from manually editing the state file.

Proposal

Having the ability to initiate a "soft" lock on the state would be very helpful logistically when designing CI/CD pipelines. The existing whole-workspace locking mechanism is sufficient, I am just proposing the ability to manually turn it on and off. Here is an example workflow, imagine a manual intervention step between the plan and apply stages:

plan stage

terraform plan -out plan.out -soft-lock-id <lock_id>

apply stage

terraform apply plan.out

failure catching stage

terraform state unlock <lock_id>

I would also propose the addition of the following terraform state subcommands:

Based on my understanding, this might work with the following changes.

State file

{
  "version": 4,
+ "soft_lock_id": "<lock_id>",
  "terraform_version": "1.1.2",
  "serial": 1,
  "lineage": "676fba34-18c2-25bc-b542-eafc3190dd35",
  "outputs": {
    "hi": {
      "value": "hi",
      "type": "string"
    }
  },
  "resources": []
}

Plan output

The output file generated by terraform plan -out could include an additional file called lock_id. The contents of this file would be the lock_id that is currently locking the state. The apply command could either process the lock_id from this file, or from a parameter passed at runtime: terraform apply -soft-lock-id=<lock_id>

I believe this would allow clients to completely ignore soft locks since having additional keys in the state file shouldn't cause any parsing problems (maybe? this is an assumption on my part) and the backend locking mechanism will still be in place for critical state locking. I believe this would make the feature opt-in and backwards compatible.

References

deberon commented 2 years ago

Per the contribution guide I am intending to implement this change via PR.

apparentlymart commented 2 years ago

Hi @deberon! Thanks for this proposal.

I've not thought through this all completely yet, so this is just an initial thought and I'd love to hear what others on our team think here too.

An important thing to consider here is that not all backends are able to sustain a lock without a running process actually "holding" the lock. For example, the local backend uses flock (or similar on other platforms) so the lock is released implicitly when the CLI process exits, and I believe the "consul" backend needs to hold open a TCP socket to a Consul server in order to sustain the lock.

One way we could address that is to allow backends to each individually opt in to supporting this sort of explicit locking, and thus it can be left unsupported (with an explicit error message) on backends that can't support it.

Another possibility would be to have a command you can run which stays running as the means to hold the lock, and then you release the lock by interrupting that process. That would then work for all backends in principle, but would still require a means like you proposed here for other commands running in the same directory to be able to use the same lock.

The backend locking semantics tend to be deceptively complex in spite of the relatively simple API, so I expect the are some other similar subtleties to consider here, but this was the one that came to my mind while initially thinking about this.

sameershah21 commented 2 years ago

Hi @deberon! Thanks for this proposal.

I've not thought through this all completely yet, so this is just an initial thought and I'd love to hear what others on our team think here too.

An important thing to consider here is that not all backends are able to sustain a lock without a running process actually "holding" the lock. For example, the local backend uses flock (or similar on other platforms) so the lock is released implicitly when the CLI process exits, and I believe the "consul" backend needs to hold open a TCP socket to a Consul server in order to sustain the lock.

One way we could address that is to allow backends to each individually opt in to supporting this sort of explicit locking, and thus it can be left unsupported (with an explicit error message) on backends that can't support it.

Another possibility would be to have a command you can run which stays running as the means to hold the lock, and then you release the lock by interrupting that process. That would then work for all backends in principle, but would still require a means like you proposed here for other commands running in the same directory to be able to use the same lock.

The backend locking semantics tend to be deceptively conplex in spite of the relatively simple API, so I expect the are some other similar subtleties to consider here, but this was the one that came to my mind while initially thinking about this.

If the backend locking mechanisms are fairly complex to be achieved.... One other way that I have been implementing to counter the need for lock (in AWS) would using IAM. I have found foll setup to be helpful. Note that these steps are for AWS, but similar can be accomplished in other clouds Azure/GCP..etc Create:

Now when CI/CD Pipelines want to run the pipeline, it will use the CI/CD prinicpal with the assumed role to lock and access the bucket object. No other principal would be able to write to this S3 bucket object.

I am not sure how feasible this solution is for every use case, but for me, it has worked well over the years. Hence, just wanted to share

jbardin commented 2 years ago

For some background, the original design intent of the state locking mechanism was only to guard the state data against concurrent modification (early on, users concurrently working with S3 without some sort of global orchestration mechanism would find themselves with the wrong state at times). The -lock-timeout option was only added after the fact once we could be sure that it didn't impose any undue restrictions on backend maintenance. We purposely did not create a model which allowed persistent locks, not only because not all implementations could maintain such locks, but it was outside the design goals of the Terraform CLI. Locking the state is only part of implementing a complete workflow in Terraform, hence the workflow tooling should also manage the various levels of synchronization.

Once we have a new interface designed for remote state storage, we can document a more precise contract for the locking mechanisms. Hopefully in the process we can simplify things a bit for implementors, though the semantics may change slightly in the process.

deberon commented 2 years ago

@apparentlymart I'm not proposing any changes to the existing locking mechanism. Instead I'm suggesting a process that writes an arbitrary lock id directly into the state (my proposal adds a new top level key to the state object). So the state can be unlocked (from a backend perspective) while still allowing an external process to make a claim to the state. This claim can even be ignored by terraform apply by default, which would maintain existing functionality. My worry is that this might compromise some understanding of when and how changes are written to the state.

Thanks for taking a look at this!

n2N8Z commented 1 year ago

https://github.com/hashicorp/terraform/issues/17203