tfstate file can be corrupted if operation cancelled

hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

https://www.terraform.io/

Other

42.47k stars 9.51k forks source link

tfstate file can be corrupted if operation cancelled #34266

Open MJohnson459 opened 10 months ago

MJohnson459 commented 10 months ago

Terraform Version

Terraform v1.6.4
on linux_amd64

Terraform Configuration Files

N/A

Debug Output

N/A

Expected Behavior

Note: I originally thought this was a cdktf bug but it seems this issue is related to terraform directly. The expected behaviour and reproduction steps still used cdktf although it shouldn't be necessary to reproduce.

I am running cdktf destroy via a python using subprocess.run. The exact command is

import subprocess
subprocess.run(["cdktf", "destroy", "--auto-approve"], check=True)

When this is destroying instances, pressing ctrl +c should stop the destroy and leave the terraform.<stack>.tfstate file with an accurate list of state.

Actual Behavior

The terraform.<stack>.tfstate file is empty.

Steps to Reproduce

Note: The below instructions are one way to induce the issue but it is likely there are others. Using raw terraform it should be possible if the state is large and the destroy command is exited forcefully enough.

Start an instance with a lot of resources. For example many files.

Run cdktf destroy via python

import subprocess
subprocess.run(["cdktf", "destroy", "--auto-approve"], check=True)

When the resources are being destroyed, press ctrl +c.

It seems like the python subprocess call is required for the state to be emptied permanently, however using a raw cdktf destroy I noticed that the state file is briefly emptied and then recreated. This seems dependent on the size of the state. I can reliably reproduce this with a ~5MB state file.

Additional Context

I believe this issue is related to the TODO comment here https://github.com/hashicorp/terraform/blob/1f9734619f953ecc7252d7b98a6d40d751b4ea1e/internal/states/statemgr/filesystem.go#L124C1-L129C18

With a large state file, an interrupt can happen between clearing the file, and the updated state being completely written.

The solution is likely what the comment suggests. Write the state file to a temporary file and then do a "swap" to update the state file. This would ensure the state file is never in a "corrupted" state.

References

I originally reported this against under cdktf https://github.com/hashicorp/terraform-cdk/issues/3269

apparentlymart commented 10 months ago

Thanks for reporting this, @MJohnson459.

When running outside of CDK for Terraform, Terraform CLI's behavior for SIGINT is essentially a state machine with three states:

In the initial state, no interruption has happened yet and Terraform is running normally.
On first SIGINT, Terraform CLI enters a "graceful abort" state, where it flushes a state snapshot to the configured storage immediately, asks providers to gracefully stop whatever they are currently doing, and also begins proactively flushing state snapshots to persistent storage every time a resource instance operation completes. If Terraform manages to run to completion while in this state then it will exit with at least one error message and with the state persisted correctly.
On second SIGINT (when Terraform CLI is already in the "graceful abort" state, Terraform CLI enters an "urgent abort" state where it immediately exits regardless of what else might be going on. This is to allow users to recover if, for example, the state snapshot saving is the thing that's preventing the exit, but it does cause a risk of data loss if the state snapshot saving isn't atomic and gets interrupted, such as in the filesystem storage implementation you linked to here.

With all of that said then, I think probably a first step here would be to try to understand how CDK for Terraform is participating in this process. When you are using CDK for Terraform to wrap Terraform CLI, your Ctrl+C will cause SIGINT to be sent to the CDK for Terraform process. That signal might also be sent to Terraform CLI, if CDK for Terraform is starting Terraform CLI in the same process group.

One possible cause of the problem you observed here would be if CDK for Terraform is starting Terraform CLI in its same process group, therefore causing your Ctrl+C to be sent as SIGINT to both processes, but then CDK for Terraform sends its own separate SIGINT to Terraform CLI to try to command it to shut down, which would then nudge Terraform CLI into the "urgent abort" state.

I'm not familiar enough with the CDK for Terraform codebase to know if I've found a relevant piece of code, but I do see some logic for sending Ctrl+C to Terraform CLI which seems to achieve its goal by running Terraform CLI in a pseudoterminal (pty) and then sending the Ctrl+C character code to that pseudoterminal, which presumably causes something running in that terminal to send a SIGINT to Terraform CLI. Therefore my theory above does seem plausible, but I'd need to ask the CDK for Terraform team to confirm.

If that is indeed the problem, I think the fix for this would need to be in CDK for Terraform itself, making one of the following changes.

Launch Terraform CLI in such a way that the calling shell's job control will translate Ctrl+C into a SIGINT only for the cdktf process and not also for the Terraform CLI process. Then CDK for Terraform can be responsible for sending its own SIGINT to Terraform CLI.
Remove the abort logic that sends Ctrl+C to the pty and instead allow the normal shell job control to send SIGINT to both cdktf and Terraform CLI simultaneously, at which point Terraform CLI will enter its graceful abort state concurrently with cdktf doing any of its own shutdown actions.

The main difference between these two options, I think, is that the second one would preserve Terraform CLI's ability to enter its urgent abort state if the user types Ctrl+C twice -- since Terraform CLI would be directly relating to the job control signal -- whereas the first one would make it CDK for Terraform's responsibility to decide whether to send a second SIGINT to Terraform CLI when Ctrl+C is pressed twice.

MJohnson459 commented 10 months ago

Thanks @apparentlymart

I tried and failed to do an strace to see what signals were going where but I think I found out what the trigger was in my case.

It seems like by default, Popen in python 3.7+ will wait at most 0.25s for child processes to clean up on a SIGINT. Using subprocess.run exacerbates this by sending a kill immediately after that delay. I switched to use Popen and waited manually to ensure terraform has time to clean up to fix.

That said, I still believe the corruption is a bug in terraform as even in the case where terraform is forcefully shut down, it still shouldn't be able to corrupt the state. As a further example, when I was using strace to launch cdktf it also corrupted the state. I suspect that if a user did follow the Ctrl+C twice path that would also result in a corrupt state file.

All in all, saving to an intermediate file and swapping seems a safer way to manage it and avoid any of these problems. I'm not familiar with Go but I am happy to try implement this if you have some pointers.

apparentlymart commented 10 months ago

Hi @MJohnson459,

If I recall correctly, the filesystem backend is implemented the way it is because of how file locking works on Windows: Terraform needs to hold a lock on the state snapshot throughout, and that blocks any kind of renaming or deleting of the existing object on disk.

The local filesystem storage is intended primarily for trivial experimentation anyway, and we expect anyone using Terraform "for real" should be using one of the remote state backends, which (depending on the features of the underlying storage) bring additional safety such as the preservation of historical state versions.

Even with local state, there should typically be a separate terraform.tfstate.backup file capturing the snapshot of the previous run in case the new snapshot is somehow unacceptable, including this situation of it ending up empty.

If we could find a different design that would work across all platforms Terraform supports -- the primary supported platforms are Windows, macOS, and Linux, and the other best-effort-supported unixes tend to follow from what we do for Linux and/or macOS -- then that would be great! I believe we did look for other options at the time of originally implementing this and ended up concluding that we could either have locks or atomic updates and decided that locks were more important, and the locking behavior is now protected by the Terraform v1.x Compatiblity Promises and so we cannot regress it -- but if we can find a way to achieve both together then of course that would be ideal.

MJohnson459 commented 10 months ago

Hi @apparentlymart

My use case is generally large but short term deployments which the local backend seems ideally suited to. Using cloud storage would add a fair bit of complexity and I'm not sure we would get much benefit from it. Would you still recommend using cloud backends for jobs that only last for a couple of hours with no collaboration?

Regarding the proposed change, it looks like it would be fairly straight forward but there are definitely bits I don't understand. This comment does imply the the intention was to swap the files in the future but I guess things may have changed since then. https://github.com/hashicorp/terraform/blob/1f9734619f953ecc7252d7b98a6d40d751b4ea1e/internal/states/statemgr/filesystem.go#L124C1-L129C18

If I understand the code correctly, it

Creates a backup file and saves the current state to the backup file
Cleans the current state file
Checks if the state has changed and increments the serial if it has
Save the current state + updated serial to the state file

The question is really which bit takes a long time. Is there a reason that step 2. couldn't be be delayed or 4. write to an intermediate file first?

apparentlymart commented 10 months ago

I think the main constraint as I understand it is that the following cannot both be true at once:

The file's identity (in whatever sense the OS tracks it for locking purposes) remains constant throughout the whole write operation.
The observable content of the file is valid throughout the whole write operation.

The typical way to achieve 2 on Unix is to make a new file and then rename it over the old one, but that invalidates 1. As far as I remember, even that wasn't possible on Windows where holding the lock forces the file's identity to stay constant.

Currently Terraform is prioritizing 1 at the expense of 2, which in the worst case means that Terraform can potentially be interrupted while the content of the file is not valid.

The comment in the file is reflecting that we wanted to find a way to achieve both at once. If we can find a design that does so without breaking any existing guarantees (due to the 1.x Compatibility Promises) then we would like to adopt it, and the question would then just be how to mitigate the risk of making that change.

One interesting constraint in the mix here is that in particularly-awkward cases the two systems competing to acquire a lock might not be running the same version of Terraform CLI, and in particular one might be running the current implementation which expects the lock to be held on the same file containing the state data. This constraint has tended to be the upset that's always got in the way of my first idea in this area, which is to use a separate file (whose content is irrelevant) as the holder of the lock, so that other files around it can then change identity without affecting the locked file.

My brain is in a very different place than this right now and so I'm afraid I don't have any more context ready to share off the top of my head than what I've shared already, but if you have any specific design ideas then one of my colleagues on the team can hopefully participate in a more detailed discussion about them!

MJohnson459 commented 10 months ago

Thanks @apparentlymart that makes sense.

I did a quick and dirty profiling using some extra logging after every statement to see where the time was actually being spent. It might be that there could be a good improvement just by moving the Truncate later.

Splitting the Write call into three parts, the times were roughly

Managing backup: ~7ms (after the first one)
<-- File truncated here -->
Checking state: ~134ms
Writing state: ~54ms
Total: ~195ms

The main thing was the encoding in statefile.writeStateV4 takes ~50ms on my machine with a large-ish state. Between the call to Truncate this gets called three times (two during the check, once for the real write) which means ~150ms out of the ~188ms that the file is empty is just processing the state.

Moving the Truncate method to just before the final Write call would reduce "risky" time from 188ms to 54ms or 96% to 27%. That seems a reasonably small change for a big effect? It doesn't solve the problem but it might make it much rarer. I can put a PR up for this if it seems something worth doing.

In the ideal case, if we could move the Truncate to right before the final Write in writeStateV4 we could get that time down to ~3ms or ~2% but I can't see a simple way to do that while using the io.Writer interface. It would require splitting the function into "encode" and "write" which would be a fairly large change.

Anyway, I'm going to look into using a remote backend as suggested which should avoid this for us at least.

apparentlymart commented 10 months ago

Thanks for that extra context, @MJohnson459!

I wonder if @jbardin has some further thoughts here, since I think he's the one that's spent most time in this part of the system in the past.

jbardin commented 10 months ago

I agree that moving the truncate call to after the marshal check would be a safe way to lessen the time spent with no state when there is a lot of data, though that's not really a fix for the problem.

One thing we might be able to do is create platform-specific methods for the Filesystem type, so that writeState on POSIX systems can do an atomic write, but we would need to do some research to see how the locks interact. Linux locks are only advisory locally, but could be enforced by a remote system. NFSv4 was the notable use case before, but since this was all designed CIFS mounts added mandatory locks in Linux5.4.