Soft state locking - Githubissues

deberon commented 2 years ago

Current Terraform Version

N/A

Use-cases

When designing deployment pipelines, it would be nice to have the ability to completely lock the state between a plan and apply stage. That way no other modifications to the state can be made while a change is pending.

Attempted Solutions

I looked for ways to manually lock the state and didn't find anything apart from manually editing the state file.

Proposal

Having the ability to initiate a "soft" lock on the state would be very helpful logistically when designing CI/CD pipelines. The existing whole-workspace locking mechanism is sufficient, I am just proposing the ability to manually turn it on and off. Here is an example workflow, imagine a manual intervention step between the plan and apply stages:

plan stage

terraform plan -out plan.out -soft-lock-id <lock_id>

apply stage

terraform apply plan.out

failure catching stage

terraform state unlock <lock_id>

I would also propose the addition of the following terraform state subcommands:

terraform state lock <lock_id>
terraform state unlock <lock_id>

Based on my understanding, this might work with the following changes.

State file

{
  "version": 4,
+ "soft_lock_id": "<lock_id>",
  "terraform_version": "1.1.2",
  "serial": 1,
  "lineage": "676fba34-18c2-25bc-b542-eafc3190dd35",
  "outputs": {
    "hi": {
      "value": "hi",
      "type": "string"
    }
  },
  "resources": []
}

Plan output

The output file generated by terraform plan -out could include an additional file called lock_id. The contents of this file would be the lock_id that is currently locking the state. The apply command could either process the lock_id from this file, or from a parameter passed at runtime: terraform apply -soft-lock-id=<lock_id>

I believe this would allow clients to completely ignore soft locks since having additional keys in the state file shouldn't cause any parsing problems (maybe? this is an assumption on my part) and the backend locking mechanism will still be in place for critical state locking. I believe this would make the feature opt-in and backwards compatible.

References

28710

deberon commented 2 years ago

Per the contribution guide I am intending to implement this change via PR.

apparentlymart commented 2 years ago

Hi @deberon! Thanks for this proposal.

I've not thought through this all completely yet, so this is just an initial thought and I'd love to hear what others on our team think here too.

An important thing to consider here is that not all backends are able to sustain a lock without a running process actually "holding" the lock. For example, the local backend uses flock (or similar on other platforms) so the lock is released implicitly when the CLI process exits, and I believe the "consul" backend needs to hold open a TCP socket to a Consul server in order to sustain the lock.

One way we could address that is to allow backends to each individually opt in to supporting this sort of explicit locking, and thus it can be left unsupported (with an explicit error message) on backends that can't support it.

Another possibility would be to have a command you can run which stays running as the means to hold the lock, and then you release the lock by interrupting that process. That would then work for all backends in principle, but would still require a means like you proposed here for other commands running in the same directory to be able to use the same lock.

The backend locking semantics tend to be deceptively complex in spite of the relatively simple API, so I expect the are some other similar subtleties to consider here, but this was the one that came to my mind while initially thinking about this.

sameershah21 commented 2 years ago

Hi @deberon! Thanks for this proposal.

I've not thought through this all completely yet, so this is just an initial thought and I'd love to hear what others on our team think here too.

An important thing to consider here is that not all backends are able to sustain a lock without a running process actually "holding" the lock. For example, the local backend uses flock (or similar on other platforms) so the lock is released implicitly when the CLI process exits, and I believe the "consul" backend needs to hold open a TCP socket to a Consul server in order to sustain the lock.

One way we could address that is to allow backends to each individually opt in to supporting this sort of explicit locking, and thus it can be left unsupported (with an explicit error message) on backends that can't support it.

Another possibility would be to have a command you can run which stays running as the means to hold the lock, and then you release the lock by interrupting that process. That would then work for all backends in principle, but would still require a means like you proposed here for other commands running in the same directory to be able to use the same lock.

The backend locking semantics tend to be deceptively conplex in spite of the relatively simple API, so I expect the are some other similar subtleties to consider here, but this was the one that came to my mind while initially thinking about this.

If the backend locking mechanisms are fairly complex to be achieved.... One other way that I have been implementing to counter the need for lock (in AWS) would using IAM. I have found foll setup to be helpful. Note that these steps are for AWS, but similar can be accomplished in other clouds Azure/GCP..etc Create:

IAM role for remote state access within the s3 bucket object
Trust Policy for who can assume the role (Only CI/CD Principal)
Bucket object access permissions policy for write access
IAM role policy attachments for the above

Now when CI/CD Pipelines want to run the pipeline, it will use the CI/CD prinicpal with the assumed role to lock and access the bucket object. No other principal would be able to write to this S3 bucket object.

I am not sure how feasible this solution is for every use case, but for me, it has worked well over the years. Hence, just wanted to share

jbardin commented 2 years ago

For some background, the original design intent of the state locking mechanism was only to guard the state data against concurrent modification (early on, users concurrently working with S3 without some sort of global orchestration mechanism would find themselves with the wrong state at times). The -lock-timeout option was only added after the fact once we could be sure that it didn't impose any undue restrictions on backend maintenance. We purposely did not create a model which allowed persistent locks, not only because not all implementations could maintain such locks, but it was outside the design goals of the Terraform CLI. Locking the state is only part of implementing a complete workflow in Terraform, hence the workflow tooling should also manage the various levels of synchronization.

Once we have a new interface designed for remote state storage, we can document a more precise contract for the locking mechanisms. Hopefully in the process we can simplify things a bit for implementors, though the semantics may change slightly in the process.

deberon commented 2 years ago

@apparentlymart I'm not proposing any changes to the existing locking mechanism. Instead I'm suggesting a process that writes an arbitrary lock id directly into the state (my proposal adds a new top level key to the state object). So the state can be unlocked (from a backend perspective) while still allowing an external process to make a claim to the state. This claim can even be ignored by terraform apply by default, which would maintain existing functionality. My worry is that this might compromise some understanding of when and how changes are written to the state.

Thanks for taking a look at this!

n2N8Z commented 1 year ago

https://github.com/hashicorp/terraform/issues/17203

darpham commented 1 month ago

If it's of any use to folks stumbling across this, here's a bash/zsh compatible interactive script to more easily unlock state. Could be updated for running in CI/CD in an automated fashion with some pre-validation - YMMY and use with due caution.

tf_unlock() {
    local LOCK_ID
    local ERROR_OUTPUT
    local LOCK_DETAILS

    # Capture both stdout and stderr
    ERROR_OUTPUT=$(terraform plan -json 2>&1)

    # Check if there's a lock error
    if [[ $ERROR_OUTPUT == *"Error acquiring the state lock"* ]]; then
        LOCK_DETAILS=$(echo "$ERROR_OUTPUT" | jq -r '.diagnostic.detail // empty' 2>/dev/null)
        if [[ -z "$LOCK_DETAILS" ]]; then
            LOCK_DETAILS=$(echo "$ERROR_OUTPUT" | grep -A10 "Lock Info:")
        fi
        LOCK_ID=$(echo "$LOCK_DETAILS" | awk '/ID:/ {print $2; exit}')

        echo "State is locked. Lock details:"
        echo "$LOCK_DETAILS"
        echo

        echo -n "Do you want to unlock this state? Type 'yes' to confirm: "
        read response
        if [[ "$response" == "yes" ]]; then
            echo "Attempting to unlock..."
            if terraform force-unlock --force "${LOCK_ID}"; then
                echo "Terraform state has been successfully unlocked!"
            else
                echo "Failed to unlock the state. Please check the error message above." >&2
                return 1
            fi
        else
            echo "Unlock cancelled."
            return 0
        fi
    elif [[ $ERROR_OUTPUT == *"Error:"* ]]; then
        echo "Error occurred while checking Terraform state:" >&2
        echo "$ERROR_OUTPUT" >&2
        return 1
    else
        echo "State is not locked. No action needed."
    fi
}

hashicorp / terraform

Soft state locking #30277

Current Terraform Version

Use-cases

Attempted Solutions

Proposal

State file

Plan output

References