hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.71k stars 9.55k forks source link

0.13.0 upgrade from 0.12.28 now seems to require some state locking parameters I can't find any information on. #25818

Open jwshive opened 4 years ago

jwshive commented 4 years ago

Terraform Version

Terraform v0.13.0

Terraform Configuration Files

I am running everything through azure devops pipelines and doing replacetokens.

terraform {
  backend "#{cloud-provider}#" {
    resource_group_name = "#{agency}#-#{department}#-#{environment}#-tfrg"
    storage_account_name = "#{agency}##{department}##{application}##{environment}#tfsa"
    container_name = "#{application}#-terraform"
    key = "#{tfstatestoragekey}#"

    }
}

Debug Output

Crash Output

Expected Behavior

terraform plan should have run without issue.

Actual Behavior

2020-08-12T12:21:13.1149001Z [command]/opt/hostedtoolcache/terraform/0.13.0/x64/terraform plan
2020-08-12T12:21:14.5272154Z 
2020-08-12T12:21:14.5276813Z Error: Error locking state: Error acquiring the state lock: 2 errors occurred:
2020-08-12T12:21:14.5277758Z    * state blob is already locked
2020-08-12T12:21:14.5278392Z    * blob metadata "terraformlockid" was empty
2020-08-12T12:21:14.5278780Z 
2020-08-12T12:21:14.5278995Z 
2020-08-12T12:21:14.5279172Z 
2020-08-12T12:21:14.5279579Z Terraform acquires a state lock to protect the state from being written
2020-08-12T12:21:14.5280226Z by multiple users at the same time. Please resolve the issue above and try
2020-08-12T12:21:14.5281317Z again. For most commands, you can disable locking with the "-lock=false"
2020-08-12T12:21:14.5282137Z flag, but this is not recommended.
2020-08-12T12:21:14.5282432Z 
2020-08-12T12:21:14.5282906Z 
2020-08-12T12:21:14.5420354Z ##[error]Error: The process '/opt/hostedtoolcache/terraform/0.13.0/x64/terraform' failed with exit code 1
2020-08-12T12:21:14.5440293Z ##[section]Finishing: Terraform Plan

Steps to Reproduce

  1. terraform init
  2. terraform validate
  3. terraform plan

Additional Context

This runs via an Azure DevOps pipeline. I see many links talking about state locking if your backend supports it. I don't see any document telling me how to implement some sort of fix for this in my terraform code or pipeline. Am I do manually break the lease everytime I run code? That seems like more work than it should be. This same code ran yesterday on 0.12.28 and runs again when I change the version back to 0.12.28.

References

brenak commented 4 years ago

I assume you are using Azure Storage Account here. Not that this will help with what caused the lock, but you can force the existing lock to be released with the following command:

az storage blob lease break -b FILE_NAME -c CONTAINER_NAME --account-name STORAGEACCOUNT_NAME --account-key ACCESS_KEY

jwshive commented 4 years ago

Thanks for the reply, I figured this would be the easiest solution.

I knew you could do it with another command execution, but I guess my bigger question is why is it so different between 0.12 and 0.13 and where in the TF files could I do this vs changing all my pipelines to add an additional step.

brenak commented 4 years ago

I don't have good answers on that. You shouldn't have to do this every time. I've only ever encountered this locking issue if terraform was in the middle of updating the state, and it somehow lost connection or my system crashed leaving it locked. Its really rare that this happens.

danieldreier commented 4 years ago

You should not need to deal with locking each time. The point of the lock is to prevent two terraform runs from happening at once with the same state. Are you able to reproduce this outside of the azure pipeline, on a local workstation?

jwshive commented 4 years ago

Thanks for your question. I did try this outside of azure pipelines and receive the same error. I've created my own quick TF file to test and I can reproduce my results.

This is the debug output I get from 0.13.0

2020/08/12 17:20:24 [DEBUG] Azure Backend Response for https://storageaccountname.blob.core.windows.net/impact-terraform/terraform.tfstate:
HTTP/1.1 200 OK
Content-Length: 43976
Accept-Ranges: bytes
Content-Type: application/json

Date: Wed, 12 Aug 2020 21:20:23 GMT
Etag: "0x8D83EDF202DA166"
Last-Modified: Wed, 12 Aug 2020 16:45:31 GMT
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
X-Ms-Access-Tier: Hot
X-Ms-Access-Tier-Inferred: true
X-Ms-Blob-Type: BlockBlob
X-Ms-Creation-Time: Mon, 10 Aug 2020 17:25:27 GMT
X-Ms-Lease-State: available
X-Ms-Lease-Status: locked
X-Ms-Request-Id: d918551a-a01e-00dc-6fee-70e661000000
X-Ms-Server-Encrypted: true
X-Ms-Version: 2018-11-09

Error: Error locking state: Error acquiring the state lock: 2 errors occurred:
        * state blob is already locked
        * blob metadata "terraformlockid" was empty

Terraform acquires a state lock to protect the state from being written
by multiple users at the same time. Please resolve the issue above and try
again. For most commands, you can disable locking with the "-lock=false"
flag, but this is not recommended.

But when I run the same code with my terraform0.12.29 binary, it blows right past all that and starts with the actual plan. I see where it says in the output

X-Ms-Lease-State: available X-Ms-Lease-Status: locked

and I figure that must be what terraform is now reading, but this works every time with the previous version.

Interestingly enough, if I do not use my working remote backend in azure and instead create a brand new remote backend from scratch, this works without issue. I could have missed, but I didn't see any instructions on patching the remote backends for an upgrade.

When I try to use the command above and break the lease I get an error, there is currently no lease on the blob.

What I have found now is that when I create a storage account WITHOUT Hierarchical namespace, the status of the blob once the write is finished is available and unlocked, when I create the storage account WITH Hierarchical namespace, the default state seems to be locked and available. The first run in a new state file always works, but all the jobs after that fail. Seems to be an issue with how the Hierarchical namespace works with storage accounts and lease states.


danieldreier commented 4 years ago

Do you have a way to check whether this happens exclusively with the azure state backend, or are you also seeing this with any other state backend? The AzureRM provider team maintains the state backend, and so I'm trying to triage which team needs to troubleshoot this. If it's common to all backends, it's a core issue, and if it's specific to that backend I'll send it to the azure team.

jwshive commented 4 years ago

We only use azure here so I don't have anything easy to test with aws. I don't know that AWS has a hierarchical namespace, that's just my unfamiliarity with their service.

Looking at some of my other storage accounts, I see blobs in there that are unlocked and available. It seems to just be something happening with my terraform state file where it's available but remains locked.

ajlancaster commented 4 years ago

I'm really glad I found this page. I've been having the same issue all week. I had thought it was because of the unique way this particular environment was setup. So i copied out the code onto my own machine, which was running 0.12.24 at the time, it worked fine. I then upgraded to 0.13.0, ran a TF init, which was fine, and then ran a plan, and got this exact issue. As with other people in this post, I viewed the least state in Azure, it was 'Available'. So i manually leased the blob, then released it. It was then in a 'Broken' state. If I then run plan/apply/destroy, it works without issue. There is definitely some issue between the new TF 0.13.0 binary and the Azure storage account. I'm really hopeful this gets fixed very quickly

jwshive commented 4 years ago

I am just poking around commits for 0.13.0 and ran across this one. I’m not 100% sure what it’s trying to do but it involves local and remote state unlocking.

https://github.com/hashicorp/terraform/commit/86e9ba3d659176cd7ea969434e37cb064f23bb43

n2qz commented 4 years ago

Also getting this with azurerm backend. It's a blocker right now for us to upgrade. Breaking the lease manually seems to help briefly but the issue recurs.

gdubya commented 4 years ago

Same problem here after importing local state to an azure storage account

n2qz commented 4 years ago

What I have found now is that when I create a storage account WITHOUT Hierarchical namespace, the status of the blob once the write is finished is available and unlocked, when I create the storage account WITH Hierarchical namespace, the default state seems to be locked and available. The first run in a new state file always works, but all the jobs after that fail. Seems to be an issue with how the Hierarchical namespace works with storage accounts and lease states.

Thank you for this tip. I was able to move past this here, by creating a new storage account with hierarchical namespace disabled, and migrating state files to it before upgrading to 0.13.

ChuckkNorris commented 2 months ago

Also seeing this issue on Terraform v1.9.5 w/ azurerm v3.0.2. I manually created a new storage account/container via the Azure CLI, configured the back-end locally, and it never releases the lock after an operation completes.

For example: