Open deberon opened 2 years ago
Per the contribution guide I am intending to implement this change via PR.
Hi @deberon! Thanks for this proposal.
I've not thought through this all completely yet, so this is just an initial thought and I'd love to hear what others on our team think here too.
An important thing to consider here is that not all backends are able to sustain a lock without a running process actually "holding" the lock. For example, the local backend uses flock (or similar on other platforms) so the lock is released implicitly when the CLI process exits, and I believe the "consul" backend needs to hold open a TCP socket to a Consul server in order to sustain the lock.
One way we could address that is to allow backends to each individually opt in to supporting this sort of explicit locking, and thus it can be left unsupported (with an explicit error message) on backends that can't support it.
Another possibility would be to have a command you can run which stays running as the means to hold the lock, and then you release the lock by interrupting that process. That would then work for all backends in principle, but would still require a means like you proposed here for other commands running in the same directory to be able to use the same lock.
The backend locking semantics tend to be deceptively complex in spite of the relatively simple API, so I expect the are some other similar subtleties to consider here, but this was the one that came to my mind while initially thinking about this.
Hi @deberon! Thanks for this proposal.
I've not thought through this all completely yet, so this is just an initial thought and I'd love to hear what others on our team think here too.
An important thing to consider here is that not all backends are able to sustain a lock without a running process actually "holding" the lock. For example, the local backend uses flock (or similar on other platforms) so the lock is released implicitly when the CLI process exits, and I believe the "consul" backend needs to hold open a TCP socket to a Consul server in order to sustain the lock.
One way we could address that is to allow backends to each individually opt in to supporting this sort of explicit locking, and thus it can be left unsupported (with an explicit error message) on backends that can't support it.
Another possibility would be to have a command you can run which stays running as the means to hold the lock, and then you release the lock by interrupting that process. That would then work for all backends in principle, but would still require a means like you proposed here for other commands running in the same directory to be able to use the same lock.
The backend locking semantics tend to be deceptively conplex in spite of the relatively simple API, so I expect the are some other similar subtleties to consider here, but this was the one that came to my mind while initially thinking about this.
If the backend locking mechanisms are fairly complex to be achieved.... One other way that I have been implementing to counter the need for lock (in AWS) would using IAM. I have found foll setup to be helpful. Note that these steps are for AWS, but similar can be accomplished in other clouds Azure/GCP..etc Create:
Now when CI/CD Pipelines want to run the pipeline, it will use the CI/CD prinicpal with the assumed role to lock and access the bucket object. No other principal would be able to write to this S3 bucket object.
I am not sure how feasible this solution is for every use case, but for me, it has worked well over the years. Hence, just wanted to share
For some background, the original design intent of the state locking mechanism was only to guard the state data against concurrent modification (early on, users concurrently working with S3 without some sort of global orchestration mechanism would find themselves with the wrong state at times). The -lock-timeout
option was only added after the fact once we could be sure that it didn't impose any undue restrictions on backend maintenance. We purposely did not create a model which allowed persistent locks, not only because not all implementations could maintain such locks, but it was outside the design goals of the Terraform CLI. Locking the state is only part of implementing a complete workflow in Terraform, hence the workflow tooling should also manage the various levels of synchronization.
Once we have a new interface designed for remote state storage, we can document a more precise contract for the locking mechanisms. Hopefully in the process we can simplify things a bit for implementors, though the semantics may change slightly in the process.
@apparentlymart I'm not proposing any changes to the existing locking mechanism. Instead I'm suggesting a process that writes an arbitrary lock id directly into the state (my proposal adds a new top level key to the state object). So the state can be unlocked (from a backend perspective) while still allowing an external process to make a claim to the state. This claim can even be ignored by terraform apply
by default, which would maintain existing functionality. My worry is that this might compromise some understanding of when and how changes are written to the state.
Thanks for taking a look at this!
If it's of any use to folks stumbling across this, here's a bash/zsh compatible interactive script to more easily unlock state. Could be updated for running in CI/CD in an automated fashion with some pre-validation - YMMY and use with due caution.
tf_unlock() {
local LOCK_ID
local ERROR_OUTPUT
local LOCK_DETAILS
# Capture both stdout and stderr
ERROR_OUTPUT=$(terraform plan -json 2>&1)
# Check if there's a lock error
if [[ $ERROR_OUTPUT == *"Error acquiring the state lock"* ]]; then
LOCK_DETAILS=$(echo "$ERROR_OUTPUT" | jq -r '.diagnostic.detail // empty' 2>/dev/null)
if [[ -z "$LOCK_DETAILS" ]]; then
LOCK_DETAILS=$(echo "$ERROR_OUTPUT" | grep -A10 "Lock Info:")
fi
LOCK_ID=$(echo "$LOCK_DETAILS" | awk '/ID:/ {print $2; exit}')
echo "State is locked. Lock details:"
echo "$LOCK_DETAILS"
echo
echo -n "Do you want to unlock this state? Type 'yes' to confirm: "
read response
if [[ "$response" == "yes" ]]; then
echo "Attempting to unlock..."
if terraform force-unlock --force "${LOCK_ID}"; then
echo "Terraform state has been successfully unlocked!"
else
echo "Failed to unlock the state. Please check the error message above." >&2
return 1
fi
else
echo "Unlock cancelled."
return 0
fi
elif [[ $ERROR_OUTPUT == *"Error:"* ]]; then
echo "Error occurred while checking Terraform state:" >&2
echo "$ERROR_OUTPUT" >&2
return 1
else
echo "State is not locked. No action needed."
fi
}
Current Terraform Version
N/A
Use-cases
When designing deployment pipelines, it would be nice to have the ability to completely lock the state between a
plan
andapply
stage. That way no other modifications to the state can be made while a change is pending.Attempted Solutions
I looked for ways to manually lock the state and didn't find anything apart from manually editing the state file.
Proposal
Having the ability to initiate a "soft" lock on the state would be very helpful logistically when designing CI/CD pipelines. The existing whole-workspace locking mechanism is sufficient, I am just proposing the ability to manually turn it on and off. Here is an example workflow, imagine a manual intervention step between the plan and apply stages:
plan stage
apply stage
failure catching stage
I would also propose the addition of the following
terraform state
subcommands:terraform state lock <lock_id>
terraform state unlock <lock_id>
Based on my understanding, this might work with the following changes.
State file
Plan output
The output file generated by
terraform plan -out
could include an additional file calledlock_id
. The contents of this file would be thelock_id
that is currently locking the state. Theapply
command could either process thelock_id
from this file, or from a parameter passed at runtime:terraform apply -soft-lock-id=<lock_id>
I believe this would allow clients to completely ignore soft locks since having additional keys in the state file shouldn't cause any parsing problems (maybe? this is an assumption on my part) and the backend locking mechanism will still be in place for critical state locking. I believe this would make the feature opt-in and backwards compatible.
References