Early SIGINT is not respected if a Terraform Core operation subsequently begins

orgads commented 2 years ago

Terraform Version

Terraform v1.2.4
on windows_amd64
+ provider registry.terraform.io/hashicorp/azurerm v3.12.0

Terraform Configuration Files

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = ">=3.12.0"
    }
  }
  backend "azurerm" {}
}
provider "azurerm" {
  features {}
}
data "azurerm_subscription" "primary" {}
output "foo" {
  value = "Hello world"
}

Debug Output

https://gist.github.com/orgads/4a9e79e173ab4deeed93b34248cf36ee

Expected Behavior

When terraform is forcefully aborted (Ctrl-C twice), the lock should not be acquired if not acquired yet, and should be released if already acquired.

I'm not sure if this is a core or azurerm provider issue, feels like core to me.

Actual Behavior

The lock is acquired and not released, so force-unlock must be executed afterwards.

Steps to Reproduce

terraform init -backend-vars backend.tfvars (you need variables for Azure backend)
terraform apply
Ctrl-C twice

References

16652 (unsure if it's the same)

apparentlymart commented 2 years ago

Hi @orgads! Thanks for reporting this.

Unfortunately I think this request is kinda against the principle of the "forced abort" functionality: the intent here is to drop everything and immediately exit, without blocking on anything, presumably because the first interrupt signal already causes Terraform to block on something.

For example, if releasing the lock ended up taking more than, say, a few seconds then I expect we'd hear from folks that they want a "really really force exit" signal they can use to abort that request.

Given that, I'd like to learn a little more about why you are sending Terraform two SIGINT, rather than sending one and waiting for it to exit. Is there something else Terraform is blocking on that you would rather it didn't, and so you are aiming to cancel only the specific long-running thing that's getting in your way?

orgads commented 2 years ago

I see your point.

Like I wrote, at the very least I'd expect it not to acquire the lock if aborted before this stage. The current behavior is that even if I abort at a very early stage, the lock is still acquired.

Our terraform deployment is quite large, and a full refresh takes about 3 minutes. Sometimes I want to run with -target, or run with -refresh=false, but run full apply by mistake. When I try to abort, it still refreshes everything before aborting, so I force-abort. But then the lock is stuck.

apparentlymart commented 2 years ago

Thanks for the additional context, @orgads.

Based on this additional information, it seems like the root problem in your case is that an interrupt signal apparently didn't abort the refreshing process, and so Terraform still runs the refresh process to completion (which takes three minutes in your case) before handling the interrupt.

In modern Terraform the refreshing process is just a part of the broader planning process, and so it's weird that you weren't able to interrupt the planning process using an interrupt signal. If you'd be willing, I'd like to try to figure out why interrupting the planning phase didn't work for you, rather than focusing on the "force abort" case, because it sounds like you would've been happy with the non-forced interrupt behavior if it had exited promptly, and in principle the planning phase should be fine to interrupt early because planning operations are usually free of externally-visible side-effects.

You also mentioned that force-abort doesn't stop Terraform from acquiring a lock even if it wasn't already holding one. I think that's because the interrupt handler belongs to a "running operation" (an internal concept used in Terraform's backend abstraction) and we only start operations while already holding the lock, and so if you manage to send Terraform a SIGINT in the narrow window before it starts an operation it will still acquire the lock and start the operation before checking for cancellation. We could potentially include an earlier check, but I wonder how long (in real time) that window is and therefore how likely it is that a SIGINT would happen to arrive during that window in practice. Perhaps if we can figure out why gracefully canceling the plan phase didn't work for you then the handling of a force-abort before starting the planning operation would be less important?

orgads commented 2 years ago

Yes, I agree with both points.

Regarding the timing, at least on Windows I have 3-4 seconds before acquiring the lock.

orgads commented 2 years ago

Any progress with this? Anything I can do to help?

apparentlymart commented 2 years ago

Hi @orgads,

I've just got back from vacation so I'm afraid I'm catching back up on some things I was looking at before I went. Sorry for the delayed response.

Reading back what I said above, I think I was hoping to hear from you on what exactly Terraform was working on when you tried to interrupt it and found that it didn't respond. One way we could look at that is if you can run Terraform in the way you normally would and interrupt it at a "realistic" time where you'd want it to be able to end quickly. Then if you can share the output of that -- indicating where in the process you sent the interrupt signal, so I can see what was ongoing at that moment and what (if anything) started executing after the interrupt signal, then hopefully we can narrow down what's causing the problem.

In this case I think it'd be easiest to start without TF_LOG=trace, and see if Terraform's normal output is sufficient to narrow it down, since as you've seen the trace output is very chatty and I suspect that all of those details won't be super important for this first level of investigation.

(If your output is particularly large -- which it sounds like it might be if your configuration is also large -- I'd suggest sharing it via GitHub Gist and then sharing a link here, just as you did for the opening comment but without sending the second interrupt to force it to exit.)

orgads commented 2 years ago

I now see that there's a difference between very early interrupt and interrupt while refreshing.

If I interrupt before refresh is started, it goes all the way until refresh ends, and then interrupts.

But if I interrupt during refresh, it terminates quickly.

I now see that there's a difference between very early interrupt and interrupt while refreshing.

If I interrupt before refresh is started (either before or after "Acquiring state lock"), it goes all the way until refresh ends, and then interrupts.

But if I interrupt during refresh, it terminates quickly.

Here are sample outputs:

Early interrupt

``` $ ./tf apply Interrupt received. Please wait for Terraform to exit or data loss may occur. Gracefully shutting down... Acquiring state lock. This may take a few moments... module.byoc.data.pkcs12_archive.sbc: Reading... data.pkcs12_archive.acio: Reading... restapi_object.voxbone_voice_uri: Refreshing state... [id=1018072] module.byoc_platform.random_password.center_bearer_token: Refreshing state... [id=none] module.allowed-ips.data.http.ip: Reading... module.aks-deploy.data.http.ip: Reading... module.rabbitmq.random_password.root: Refreshing state... [id=none] module.rabbitmq.random_password.erlang_cookie: Refreshing state... [id=none] module.aks-deploy.data.http.ip: Read complete after 1s [id=http://ifconfig.me] module.allowed-ips.data.http.ip: Read complete after 1s [id=http://ifconfig.me] random_password.mongodb-byoc: Refreshing state... [id=none] module.mongodb.random_password.root: Refreshing state... [id=none] module.byoc.random_password.opaque-signature-key: Refreshing state... [id=none] module.frt.random_password.testcall_account: Refreshing state... [id=none] module.frt.random_password.powerbi: Refreshing state... [id=none] random_password.rasa_token: Refreshing state... [id=none] module.byoc_platform_new.random_password.center_bearer_token: Refreshing state... [id=none] module.aks-deploy.random_password.az_default: Refreshing state... [id=none] random_password.telnyx: Refreshing state... [id=none] module.aks-deploy.random_password.az_master: Refreshing state... [id=none] module.mongodb.random_password.replica-set-key: Refreshing state... [id=none] random_password.mongodb-vaic: Refreshing state... [id=none] module.frt.random_password.sql: Refreshing state... [id=none] data.restapi_object.telnyx_outbound_profile: Reading... ... ... EVERYTHING IS REFRESHED ... ... azurerm_key_vault_access_policy.dfcx_kv_policy["vaic"]: Refreshing state... [id=/subscriptions/.../resourceGroups/byoc-rolling/providers/Microsoft.KeyVault/vaults/byoc-rolling-dfcx-kv/objectId/...] azurerm_key_vault_access_policy.app_kv_policy["vaic"]: Refreshing state... [id=/subscriptions/.../resourceGroups/byoc-rolling/providers/Microsoft.KeyVault/vaults/byoc-rolling-app-kv/objectId/...] azurerm_key_vault_access_policy.app_kv_policy["aks"]: Refreshing state... [id=/subscriptions/.../resourceGroups/byoc-rolling/providers/Microsoft.KeyVault/vaults/byoc-rolling-app-kv/objectId/...] azurerm_key_vault_access_policy.app_kv_policy["self"]: Refreshing state... [id=/subscriptions/.../resourceGroups/byoc-rolling/providers/Microsoft.KeyVault/vaults/byoc-rolling-app-kv/objectId/...] No changes. Your infrastructure matches the configuration. Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed. ╷ │ Error: execution halted │ │ ╵ Releasing state lock. This may take a few moments... ```

Interrupt later

``` $ ./tf apply Acquiring state lock. This may take a few moments... data.restapi_object.telnyx_outbound_profile: Reading... module.byoc.data.pkcs12_archive.sbc: Reading... data.pkcs12_archive.acio: Reading... module.allowed-ips.data.http.ip: Reading... module.aks-deploy.data.http.ip: Reading... random_password.rasa_token: Refreshing state... [id=none] module.aks-deploy.random_password.az_master: Refreshing state... [id=none] module.aks-deploy.random_password.az_default: Refreshing state... [id=none] module.rabbitmq.random_password.root: Refreshing state... [id=none] random_password.mongodb-byoc: Refreshing state... [id=none] module.frt.random_password.testcall_account: Refreshing state... [id=none] module.aks-deploy.data.http.ip: Read complete after 0s [id=http://ifconfig.me] module.mongodb.random_password.root: Refreshing state... [id=none] module.mongodb.random_password.replica-set-key: Refreshing state... [id=none] module.frt.random_password.script-login-salt: Refreshing state... [id=none] module.allowed-ips.data.http.ip: Read complete after 0s [id=http://ifconfig.me] module.byoc_platform_new.random_password.center_bearer_token: Refreshing state... [id=none] module.byoc.random_password.opaque-signature-key: Refreshing state... [id=none] module.frt.random_password.powerbi: Refreshing state... [id=none] random_password.telnyx: Refreshing state... [id=none] module.rabbitmq.random_password.erlang_cookie: Refreshing state... [id=none] module.byoc_platform.random_password.center_bearer_token: Refreshing state... [id=none] random_password.mongodb-vaic: Refreshing state... [id=none] module.frt.random_password.sql: Refreshing state... [id=none] Interrupt received. Please wait for Terraform to exit or data loss may occur. Gracefully shutting down... data.restapi_object.telnyx_outbound_profile: Read complete after 1s [id=...] restapi_object.telnyx_connection: Refreshing state... [id=...] ╷ │ Error: Invalid provider configuration │ │ Provider "registry.terraform.io/hashicorp/azuread" requires explicit configuration. Add a provider block to the root module and configure the provider's required arguments as described in the │ provider documentation. │ ╵ ╷ │ Error: could not configure AzureCli Authorizer: could not parse Azure CLI version: running Azure CLI: exit status 0xc000013a: Traceback (most recent call last): │ File "runpy.py", line 196, in _run_module_as_main │ File "runpy.py", line 86, in _run_code``` | ... │ KeyboardInterrupt │ ^C │ │ with provider["registry.terraform.io/hashicorp/azuread"], │ on line 0: │ (source code not available) │ ╵ ╷ │ Error: building AzureRM Client: please ensure you have installed Azure CLI version 2.0.79 or newer. Error parsing json result from the Azure CLI: waiting for the Azure CLI: exit status 0xc000013a: Traceback (most recent call last): │ File "D:\a\1\s\build_scripts\windows\artifacts\cli\Lib\site-packages\azure/cli/__main__.py", line 60, in │ File "D:\a\1\s\build_scripts\windows\artifacts\cli\Lib\site-packages\azure/cli/__main__.py", line 53, in │ SystemExit: 0 │ │ During handling of the above exception, another exception occurred: │ │ Traceback (most recent call last): │ File "runpy.py", line 196, in _run_module_as_main │ File "runpy.py", line 86, in _run_code | ... │ KeyboardInterrupt │ ^C. │ │ with provider["registry.terraform.io/hashicorp/azurerm"], │ on deps.tf line 38, in provider "azurerm": │ 38: provider "azurerm" { │ ╵ Releasing state lock. This may take a few moments... ```

orgads commented 1 year ago

ping

orgads commented 1 year ago

@apparentlymart Do you need anything else?

apparentlymart commented 1 year ago

Hi @orgads. Thanks for sharing this information.

I notice a few different things about what you shared:

In the first case, as you pointed out, Terraform seems to acknowledge the interrupt but then proceed with planning anyway. It seems that there's an improper-sequencing race condition here: Terraform CLI is detecting the interrupt before it runs the Terraform Core "plan" operation, but there isn't an operation running yet so there isn't anything for Terraform CLI to interrupt. However, it then starts running the plan and Terraform Core doesn't know there was previously an interrupt (because Terraform Core wasn't even running yet when it arrived) so it runs to completion anyway, despite the interrupt.
In the second case there seems to be an extra oddity that the Azure provider is running Azure CLI (I assume) as a child process, and so your SIGINT is being received by both Azure CLI and Terraform. That makes the Azure CLI process exit with an error, which therefore halts further work in Terraform Core regardless of the interrupt because presumably most other things in your configuration depend on the provider being configured.

On this second point I'm not sure that we can do anything about it in Terraform CLI/Core, but it also doesn't really seem to be a big problem in this case, so perhaps nothing needs to change there. Terraform providers themselves typically exclude themselves from receiving SIGINT directly and rely on Terraform CLI to tell them to stop when it receives SIGINT, so that everything shuts down in a controlled order, but since Azure CLI is a third-party program we probably can't control its handling of SIGINT, unless the Azure provider itself does something unusual like arranging for the Azure CLI process to belong to a separate process group.

The first issue does seem like a bug that we could fix in this codebase, though. I'm not sure yet exactly where the bug is and therefore not sure exactly how to solve it, but I suspect that the root cause is in the way the responsibility for stopping is split between Terraform CLI and Terraform Core, and so we might need to redesign that slightly so that there's some way for Terraform Core to notice that a cancellation arrived before it even started planning.

I have some random code bookmarks that a future person looking at this issue might find useful...

In the Terraform Core Stop method it behaves as a no-op if there isn't already an operation running, because it just does nothing if there's no "current context" saved:

https://github.com/hashicorp/terraform/blob/6c2c6cfa1b55bd6ff4cf4e26ef86d8d7ab0465ec/internal/terraform/context.go#L178-L203

This is problematic because it makes Terraform Core's handling of Stop timing-sensitive: there might not currently be something to cancel, but then an operation could start later which adds a fresh context that isn't cancelled. If we want to keep using Stop then we should perhaps change it so that the cancellation context gets created at instantiation, inside terraform.NewContext, and is shared across all operations so that once Stop has been called all future operations can notice and promptly terminate.
Alternatively, we could change the API for the Terraform Core operations like Context.Apply to directly take a context as their first argument, and then have Terraform CLI pass in the same context it's using itself to handle cancellation. This is using contexts in a way more like how they are intended to be used; I think Terraform is designed the way it currently is because Terraform predates Go's context.Context concept and so we retrofitted this in a way that worked with how Terraform Core had originally been designed.

If we take this approach then Terraform CLI could potentially (if the timing works out such that the SIGINT arrives just as Terraform CLI is calling Terraform Core) pass an already-cancelled context into Terraform Core, which would then avoid the incorrect sequencing we seem to currently have in that case.

I want to be up front with you that because this behavior only affects an edge case, and that it "fails safe" by running for longer than necessarily rather than failing with data loss, this particular bug will probably not be prioritized as highly as some others, and so it may be some time before anyone on the Terraform team at HashiCorp can work on it further.

I think the next step here, once someone is free to look at this, would be to understand where exactly the improper sequencing problem is and try to write a test that demonstrates it, and then we'll have something to test against as we work on a revised implementation that doesn't have this problem.

Thanks again for reporting this!

orgads commented 1 year ago

Thanks a lot for the detailed explanation!

hashicorp / terraform