hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.31k stars 9.49k forks source link

in concurrent scenarious lock with gcs backend fails with "Error acquiring the state lock" #30149

Open oprudkyi opened 2 years ago

oprudkyi commented 2 years ago

in cicd scenarios sometimes the same lock is obtained concurrently by few processes, randomly it fails restarting failed process fixes error

Terraform Version

Terraform v1.1.0
on linux_amd64
+ provider registry.terraform.io/hashicorp/google v3.90.1
+ provider registry.terraform.io/hashicorp/google-beta v3.90.1

Terraform Configuration Files

terraform {
  backend "gcs" {
    bucket = "some-bucket"
    prefix = "some-prefix"
  }
}

Expected Behavior

Lock obtained

Actual Behavior

Acquiring state lock. This may take a few moments...

Error: Error acquiring the state lock

Error message: 2 errors occurred:
    * writing
"gs://some-bucket/some-prefix/default.tflock"
failed: googleapi: Error 412: At least one of the pre-conditions you
specified did not hold., conditionNotMet
    * storage: object doesn't exist

Steps to Reproduce

terraform apply -auto-approve -lock-timeout=30m -no-color

Additional Context

up to 20 processes may run apply with the same lock file/gcs backend

apparentlymart commented 2 years ago

Hi @oprudkyi! Thanks for sharing this bug report.

I just want to confirm that I'm understanding what you were expecting, and what you actually observed.

You used -lock-timeout=30m here, so I assume you were intending for Terraform to keep retrying to obtain the lock for up to 30 minutes if it is already held.

But I think you are saying that sometimes (with no discernible pattern) Terraform just fails immediately with this error, without waiting for the 30 minute timeout.

Is that a correct understanding of what you reported here? Thanks!

oprudkyi commented 2 years ago

Hi @apparentlymart , yes, you are right. instead of waiting 30minutes it crashes in 30 seconds

jinlinux commented 2 years ago

Hi,

are there any updates on this? I have run into the same issue.

I am guessing it is a race condition due to GCS eventual consistency?

vfiset commented 2 years ago

Hi, we're facing this issue a lot lately where terraform does not respect the lock-timeout and fails instantly with the message in OP's Actual Behavior section.

Anyone has found workaround or steps that could be implemented to alleviate the issue ?

PhillyWebGuy commented 1 year ago

Does anyone have a solution? This is a problem with GCP still.

oprudkyi commented 1 year ago

@PhillyWebGuy I rerun ci/cd manually :(

PhillyWebGuy commented 1 year ago

If I try to run manually/locally. When I run terraform init I get this:

Initializing the backend...
╷
│ Error: storage.NewClient() failed: dialing: google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.

So I change my backend.tf to look like this:

terraform {
  backend "gcs" {
    bucket = "bucket-xxx-tfstate"
    prefix = "terraform/state"
    credentials = "my-credentials-file.json" #<-- Added this
  }
}

Once I add that credentials value, then it works. But that doesn't really solve the automated Github Actions solution I'd like to employ. The terraform init command does not fail. And it actually creates a default.tfstate file but not the default.tflock file.

I don't know if this is the problem other people are having, but to summarize:

My Github Actions workflow.yaml file:

name: 'Terraform CI'

on:
  push:
    branches:
    - develop
  pull_request:

jobs:
  terraform:
    name: 'Terraform'
    runs-on: ubuntu-latest

    steps:
    - name: Checkout
      uses: actions/checkout@v2

    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v1

    - name: Terraform Init
      run: terraform init
      env:
        GOOGLE_CREDENTIALS: ${{secrets.GOOGLE_CREDENTIALS}}

    - name: Terraform Format
      run: terraform fmt -check

    - name: Terraform Plan
      run: terraform plan
      env:
        GOOGLE_CREDENTIALS: ${{secrets.GOOGLE_CREDENTIALS}}

    - name: Terraform Apply
      run: terraform apply -auto-approve -lock-timeout=5m
      env:
        GOOGLE_CREDENTIALS: ${{secrets.GOOGLE_CREDENTIALS}}

My main.tf

resource "google_storage_bucket" "xxxx_0_logs" {
  name          = "xxx-0-logs"
  force_destroy = true
  location      = "US"
  storage_class = "STANDARD"
  versioning {
    enabled = true
  }
}
oprudkyi commented 1 year ago

Just to remind, we are experiencing this error daily - 3-5 times per each run of 20-40 concurrent terraform processes Our pipeline looks something like this

barzakov commented 1 year ago

I have the same issue. Please note that the bucket was empty before the run.

Error loading state: 2 errors occurred:

* writing "gs://XXXXXXXXXXX/default.tflock" failed: googleapi: Error 412: At least one of the pre-conditions you specified did not hold., conditionNotMet

* storage: object doesn't exist
duxbuse commented 1 year ago

Wouldn't the fix be to store the lockfile at the prefix path. This way multiple state files that all live in the same bucket dont force using the same lockfile?

oprudkyi commented 1 year ago

@duxbuse no. it would imply disabling locking at all, with dare consequences

barzakov commented 1 year ago

Please note that I run this setup almost every day. And I have this error sometimes.

Env: terragrunt version 0.35.10 terraform version 1.1.3

Output:

Group 1

Group 2

=============== cut ===============

My current error. It actually die on the first terraform.

╷ │ Error: Error acquiring the state lock │ │ Error message: 2 errors occurred: │ writing │ "gs://buket_name/bucket_prefix/dir1/terraform1/default.tflock" │ failed: googleapi: Error 412: At least one of the pre-conditions you │ specified did not hold., conditionNotMet │ storage: object doesn't exist │ │ │ │ Terraform acquires a state lock to protect the state from being written │ by multiple users at the same time. Please resolve the issue above and try │ again. For most commands, you can disable locking with the "-lock=false" │ flag, but this is not recommended.

================= cut ========================

Terraform has been successfully initialized! ╷ │ Error: Error acquiring the state lock │ │ Error message: writing │ "gs://buket_name/bucket_prefix/dir1/terraform1/default.tflock" │ │ failed: googleapi: Error 412: At least one of the pre-conditions you │ specified did not hold., conditionNotMet │ Lock Info: │ ID: XXXXXXXXXXXXX │ Path: gs://buket_name/bucket_prefix/dir1/dir2/default.tflock │ Operation: OperationTypePlan │ Who: my_server_name │ Version: 1.1.3 │ Created: 2023-07-03 20:15:46.496329978 +0000 UTC │ Info:
│ │ │ Terraform acquires a state lock to protect the state from being written │ by multiple users at the same time. Please resolve the issue above and try │ again. For most commands, you can disable locking with the "-lock=false" │ flag, but this is not recommended. ╵

oprudkyi commented 1 month ago

Fixed in other tool. closing it now as irrelevant

florianmutter commented 3 weeks ago

Could you link to where or how this was fixed? We did run in the same issue today

oprudkyi commented 3 weeks ago

@florianmutter https://github.com/opentofu/opentofu/commit/6ec06c86f54832a616d5312dd7323deda2d6eabc

florianmutter commented 3 weeks ago

With terraform this still happens to us. Maybe we can reopen the issue here.

crw commented 3 weeks ago

With apologies to @oprudkyi, I agree it would make sense to leave the issue open here. It is possible to re-report it as a new issue, but there is enough history in this issue to make it more desirable to simply keep this issue open until the GCS team works on it. Thanks!