hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.44k stars 9.51k forks source link

terraform lock resource #26422

Open steeling opened 4 years ago

steeling commented 4 years ago

Hi there,

Apologies if this is already possible, but I don't see a command listed here

I'd like to propose adding a

terraform lock -target=resource

command so that a user can lock a resource to prevent both automation and other users from making changes during outages. This is a principal taken from lock-out/tag-out used in industrial equipment maintenance, and applied to software maintenance.

steeling commented 4 years ago

I'm not sure if terraform supports locking by resource, or just locks the entire state file. If currently doing the latter, even adding a lock command for the entire file would be helpful.

Additionally, adding a lock command could help if I want to do multiple commands transactionally. ie: from https://github.com/hashicorp/terraform/issues/26423

tf lock
tf check-for-diff -lock=false
tf apply -lock=false
tf force-unlock
apparentlymart commented 4 years ago

Hi @steeling! Thanks for this enhancement request.

Terraform does indeed currently model locking as a whole-workspace idea (usually implemented by locking the object that's storing the state, as you mentioned). Some of the locking implementations are also unable to hold a lock without keeping a terraform process running to hold it, and so that's why Terraform doesn't currently have a command to just create a lock without its lifecycle being connected to some other operation.

A possible compromise here could be a command that takes the lock and then blocks at the terminal until it is interrupted by something like Ctrl+C, so you can therefore hold a Terraform lock even though Terraform isn't currently actually doing anything, but the terraform process still exists to hold it.

I think you could emulate this today by making a throwaway change to your configuration, running terraform apply, and then leaving Terraform waiting for confirmation while you do something else; Terraform holds the lock while it awaits approval for the plan, so you can in principle use it as a weird way to grab a lock and then eventually just say "no" at the confirmation prompt to release the lock without changing anything.

With all of that said, it would of course not help very much with the "running multiple Terraform commands transactionally" idea because in that case you explicitly want the lock to outlive a particular terraform process, and for those other commands to somehow pick up the same lock rather than trying to create a new one (which would otherwise deadlock).

steeling commented 4 years ago

Hey @apparentlymart, thanks for the detailed response! Would you mind explaining how a separate process determines how the lock is currently being held? Is it implementation specific depending on where the state is stored (ie: 1 impl for azure blob store, and another for GCS, or something more generic?)

Ya I don't think the running process would meet our needs unfortunately. Also instead of passing the lock from one process to another, I think we could model it like code, where I grab the lock, the other terraform actions do things without the lock (or even without knowledge that the lock is held, ie: supply the -lock=false flag. ie: consider the following golang psuedo code:

var mu sync.Mutex
mu.Lock()
defer mu.Unlock()
diffs, err := terraform.Reconcile(lock=false) # doesn't know lock is held
if err != nil {
  return err
}
if !diffs {
  terraform.Apply(lock=false)
}
return

Here's a thought on how this could be accomplished given the current locking mechanisms:

Every command that currently grabs the lock would do the following. Supplying -lock=false would skip* steps 1 & 2:

  1. Grab the lock via the running process (as is currently done)
  2. Check a new field on the state lock_status to determine if it is locked asyncrhonously
  3. If not, continue with the operation.
  4. Release the lock (same as is currently done)

The lock/unlock command would be a special case:

  1. Grab the lock via the running process (as is currently done)
  2. Check a new field on the state lock_status to determine if it is locked asyncrhonously
  3. Set the lock_status to lock/unlock (return error if dest lock_status == src lock_status)
  4. Release the lock (same as is currently done)

*Note: on skipping steps 1 & 2, it might make more sense to skip just 2.. I find it hard to imagine a scenario where one would want commands to race with each other, although maybe I'm just not thinking hard enough :)

Eventually lock_status could also be moved to each individual terraform resource

Thanks in advance for entertaining this discussion!

steeling commented 4 years ago

looking at some of the implemenations I can answer my own question above on the locking mechanism being implementation specific. Following up on that, the above pseudo code is only necessary for those specific implementations, while the rest (majority?) can just grab the lock and return.

steeling commented 3 years ago

@apparentlymart, looking into this more, it seems like terraform is doing something more complicated than simply grabbing the resource lock, ie: on an azurerm backend, if I grab the blob lease, and do tf plan -lock=false, I get:

Error: Error loading state: failed to lock azure state: 2 errors occurred:

This seems like a pretty basic feature to ensure transactionality between multiple requests, and allowing a simple mechanism for oncall ops to prevent automation from rolling forward.

apparentlymart commented 3 years ago

Hi @steeling,

The backends all have pretty different implementations of the locking interfaces with different requirements and tradeoffs, and all of them have been through many iterations to get their behavior right against the quirks of each service, so unfortunately I don't think we can consider any change to the locking model to be a "basic feature". That doesn't mean it isn't a valid feature request, but it does mean it will require a considerable design effort and is something we're unlikely to tackle in the near future due to our focus being elsewhere.

steeling commented 3 years ago

Hi @apparentlymart, thanks for the reply! That's very reasonable :)

Submitted https://github.com/hashicorp/terraform/pull/26572 to see if I can poke around in this space.

Also submitted https://github.com/hashicorp/terraform/pull/26561 to fix azure force-unlock, which doesn't work in non-default workspaces