hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
41.76k stars 9.42k forks source link

S3 backend state locking has no timeout and/or too pessimistic #15429

Open elodani opened 7 years ago

elodani commented 7 years ago

Terraform Version 0.9.8

If you have more than a few resources stored in your remote state (I have 2 Elastic Beanstalk and some IAM resources, in total 12) the terraform plan takes long enough (to be easily interrupted), and you have an S3 backend with locks, you can lock yourself out by calling a terraform plan and getting it interrupted by anything.

Then you cannot access the remote statefile anymore, because the lock is not released. I waited 30+ minutes for some kind of timeout to kick in, but it is either longer than that, or the lock is permanent.

I could only get it work again after issuing terraform destroy -lock=false than destroy and recreate the S3 bucket and dynamoDB that my backend uses.

It has happened because my session was interrupted by the system, but reproducible with Ctrl+C.

A more optimistic locking, or a reasonable timeout would be better. (especially annoying if you automatize terraform calls, you cannot know that you are really that unlucky, that someone locks you out always, or it's broken and safe to use -lock=false)

Terraform S3 Backend Config

terraform {
  backend "s3" {
    bucket = "terraremotebucketencrypted"
    key    = "terraform.tfstate"
    region = "eu-central-1"
    dynamodb_table = "TerraLockerFromTF"
    encrypt ="true"
    kms_key_id = "my kms key arn"
  }
}
provider "aws" {
  region     = "eu-central-1"
}

Expected Behavior

After some time, the lock is released (especially because it was only a plan operation)

Actual Behavior

when I tried to use the remote state again, got error, although nobody used the file:

Error locking state: Error acquiring the state lock: ConditionalCheckFailedException: The conditional request failed
    status code: 400, request id: <req ID>
Lock Info:
  ID:        <lock ID>
  Path:      terraremotebucketencrypted/terraform.tfstate
  Operation: OperationTypePlan
  Who:       <MYSELF>
  Version:   0.9.8
  Created:   <datetime>
  Info:      

Terraform acquires a state lock to protect the state from being written
by multiple users at the same time. Please resolve the issue above and try
again. For most commands, you can disable locking with the "-lock=false"
flag, but this is not recommended.

Steps to Reproduce

  1. Init a terraform with S3 backend using locks
  2. create some resources and apply
  3. call terraform plan
  4. interrupt plan operation when it uses the remote state (Ctrl+C works)
  5. call plan again to see error
jbardin commented 7 years ago

Hi @elodani,

Sorry you're we having an issue with this. The design of the state lock is meant to leave the lock when possible in the case of an abnormal exit. If you hit Ctrl+C only once, the process should have exited normally and cleaned up the lock. If you hit that twice, it forces the process to quit immediately and no cleanup can be done.

If the process didn't exit normally, there is no guarantee that the saved state is correct, and manual intervention may be required. Having the lock present gives a little more safety around accessing a corrupted state.

I'm going to keep this open as a feature request, since the implementation is possible, and not completely out of line with the semantics of other backends like consul.

In the meantime, there is a terraform force-unlock command to handle the situation for most backends, where you pass in the lock ID from the error message to manually remove a lock.

frncmx commented 7 years ago

I think timeout might be tricky to implement, since - like mentioned above - there is no guarantee for the state to be consistent.

An easy win - what I would love very much - is to lock the statefile during plan only for the time of acquiring a state snapshot. (I think Terraform already uses some cache.)

Anyway, I also agree it's just an improvement. We only had this problem once while playing around with the tool.

imunhatep commented 1 month ago

Not sure what the issue to set lock timeout and remove it on the next run in case time have expired yet lock is still there?!