Open orgads opened 2 years ago
Hi @orgads! Thanks for reporting this.
Unfortunately I think this request is kinda against the principle of the "forced abort" functionality: the intent here is to drop everything and immediately exit, without blocking on anything, presumably because the first interrupt signal already causes Terraform to block on something.
For example, if releasing the lock ended up taking more than, say, a few seconds then I expect we'd hear from folks that they want a "really really force exit" signal they can use to abort that request.
Given that, I'd like to learn a little more about why you are sending Terraform two SIGINT, rather than sending one and waiting for it to exit. Is there something else Terraform is blocking on that you would rather it didn't, and so you are aiming to cancel only the specific long-running thing that's getting in your way?
I see your point.
Like I wrote, at the very least I'd expect it not to acquire the lock if aborted before this stage. The current behavior is that even if I abort at a very early stage, the lock is still acquired.
Our terraform deployment is quite large, and a full refresh takes about 3 minutes. Sometimes I want to run with -target, or run with -refresh=false, but run full apply by mistake. When I try to abort, it still refreshes everything before aborting, so I force-abort. But then the lock is stuck.
Thanks for the additional context, @orgads.
Based on this additional information, it seems like the root problem in your case is that an interrupt signal apparently didn't abort the refreshing process, and so Terraform still runs the refresh process to completion (which takes three minutes in your case) before handling the interrupt.
In modern Terraform the refreshing process is just a part of the broader planning process, and so it's weird that you weren't able to interrupt the planning process using an interrupt signal. If you'd be willing, I'd like to try to figure out why interrupting the planning phase didn't work for you, rather than focusing on the "force abort" case, because it sounds like you would've been happy with the non-forced interrupt behavior if it had exited promptly, and in principle the planning phase should be fine to interrupt early because planning operations are usually free of externally-visible side-effects.
You also mentioned that force-abort doesn't stop Terraform from acquiring a lock even if it wasn't already holding one. I think that's because the interrupt handler belongs to a "running operation" (an internal concept used in Terraform's backend abstraction) and we only start operations while already holding the lock, and so if you manage to send Terraform a SIGINT in the narrow window before it starts an operation it will still acquire the lock and start the operation before checking for cancellation. We could potentially include an earlier check, but I wonder how long (in real time) that window is and therefore how likely it is that a SIGINT would happen to arrive during that window in practice. Perhaps if we can figure out why gracefully canceling the plan phase didn't work for you then the handling of a force-abort before starting the planning operation would be less important?
Yes, I agree with both points.
Regarding the timing, at least on Windows I have 3-4 seconds before acquiring the lock.
Any progress with this? Anything I can do to help?
Hi @orgads,
I've just got back from vacation so I'm afraid I'm catching back up on some things I was looking at before I went. Sorry for the delayed response.
Reading back what I said above, I think I was hoping to hear from you on what exactly Terraform was working on when you tried to interrupt it and found that it didn't respond. One way we could look at that is if you can run Terraform in the way you normally would and interrupt it at a "realistic" time where you'd want it to be able to end quickly. Then if you can share the output of that -- indicating where in the process you sent the interrupt signal, so I can see what was ongoing at that moment and what (if anything) started executing after the interrupt signal, then hopefully we can narrow down what's causing the problem.
In this case I think it'd be easiest to start without TF_LOG=trace
, and see if Terraform's normal output is sufficient to narrow it down, since as you've seen the trace output is very chatty and I suspect that all of those details won't be super important for this first level of investigation.
(If your output is particularly large -- which it sounds like it might be if your configuration is also large -- I'd suggest sharing it via GitHub Gist and then sharing a link here, just as you did for the opening comment but without sending the second interrupt to force it to exit.)
I now see that there's a difference between very early interrupt and interrupt while refreshing.
If I interrupt before refresh is started, it goes all the way until refresh ends, and then interrupts.
But if I interrupt during refresh, it terminates quickly.
I now see that there's a difference between very early interrupt and interrupt while refreshing.
If I interrupt before refresh is started (either before or after "Acquiring state lock"), it goes all the way until refresh ends, and then interrupts.
But if I interrupt during refresh, it terminates quickly.
Here are sample outputs:
ping
@apparentlymart Do you need anything else?
Hi @orgads. Thanks for sharing this information.
I notice a few different things about what you shared:
On this second point I'm not sure that we can do anything about it in Terraform CLI/Core, but it also doesn't really seem to be a big problem in this case, so perhaps nothing needs to change there. Terraform providers themselves typically exclude themselves from receiving SIGINT directly and rely on Terraform CLI to tell them to stop when it receives SIGINT, so that everything shuts down in a controlled order, but since Azure CLI is a third-party program we probably can't control its handling of SIGINT, unless the Azure provider itself does something unusual like arranging for the Azure CLI process to belong to a separate process group.
The first issue does seem like a bug that we could fix in this codebase, though. I'm not sure yet exactly where the bug is and therefore not sure exactly how to solve it, but I suspect that the root cause is in the way the responsibility for stopping is split between Terraform CLI and Terraform Core, and so we might need to redesign that slightly so that there's some way for Terraform Core to notice that a cancellation arrived before it even started planning.
I have some random code bookmarks that a future person looking at this issue might find useful...
In the Terraform Core Stop
method it behaves as a no-op if there isn't already an operation running, because it just does nothing if there's no "current context" saved:
This is problematic because it makes Terraform Core's handling of Stop
timing-sensitive: there might not currently be something to cancel, but then an operation could start later which adds a fresh context that isn't cancelled. If we want to keep using Stop
then we should perhaps change it so that the cancellation context gets created at instantiation, inside terraform.NewContext
, and is shared across all operations so that once Stop
has been called all future operations can notice and promptly terminate.
Alternatively, we could change the API for the Terraform Core operations like Context.Apply
to directly take a context as their first argument, and then have Terraform CLI pass in the same context it's using itself to handle cancellation. This is using contexts in a way more like how they are intended to be used; I think Terraform is designed the way it currently is because Terraform predates Go's context.Context
concept and so we retrofitted this in a way that worked with how Terraform Core had originally been designed.
If we take this approach then Terraform CLI could potentially (if the timing works out such that the SIGINT arrives just as Terraform CLI is calling Terraform Core) pass an already-cancelled context into Terraform Core, which would then avoid the incorrect sequencing we seem to currently have in that case.
I want to be up front with you that because this behavior only affects an edge case, and that it "fails safe" by running for longer than necessarily rather than failing with data loss, this particular bug will probably not be prioritized as highly as some others, and so it may be some time before anyone on the Terraform team at HashiCorp can work on it further.
I think the next step here, once someone is free to look at this, would be to understand where exactly the improper sequencing problem is and try to write a test that demonstrates it, and then we'll have something to test against as we work on a revised implementation that doesn't have this problem.
Thanks again for reporting this!
Thanks a lot for the detailed explanation!
Terraform Version
Terraform Configuration Files
Debug Output
https://gist.github.com/orgads/4a9e79e173ab4deeed93b34248cf36ee
Expected Behavior
When terraform is forcefully aborted (Ctrl-C twice), the lock should not be acquired if not acquired yet, and should be released if already acquired.
I'm not sure if this is a core or azurerm provider issue, feels like core to me.
Actual Behavior
The lock is acquired and not released, so force-unlock must be executed afterwards.
Steps to Reproduce
terraform init -backend-vars backend.tfvars
(you need variables for Azure backend)terraform apply
References
16652 (unsure if it's the same)