`context deadline exceeded` while triggered run is still queued

pedroslopez commented 2 years ago

When a run is enqueued for a long time due to the available workers being tied up, the multispace run errors with context deadline exceeded. I've noticed this specifically in destroy runs. A custom timeout has been set, but it doesn't seem to have effect for destroy runs (the same issue on create happens, but after the configured timeout as expected)

Terraform Version

Terraform 1.0.8, 1.0.9 multispace 0.1.0

Affected Resource(s)

Please list the resources as a list, for example:

multispace_run

If this issue appears to affect multiple resources, it may be an issue with Terraform's core, so please mention this.

Terraform Configuration Files

resource "tfe_workspace" "app" {
  for_each          = local.apps
  name              = "app-${each.key}-${var.aws_region}-${var.environment}"
  description       = "Terraform configuration for app-${each.key}"
  organization      = var.tfe_organization_name
  auto_apply        = true
  queue_all_runs    = false
  terraform_version = var.terraform_version
  working_directory = "environments/${var.aws_region}/${var.environment}/apps/${each.key}"
  trigger_prefixes  = ["modules", "shared/app"]
  tag_names         = ["app", var.environment]
}

resource "tfe_variable" "environment" {
  for_each     = tfe_workspace.app
  key          = "environment"
  category     = "terraform"
  value        = var.environment
  workspace_id = each.value.id
}

resource "multispace_run" "run" {
  for_each     = tfe_workspace.app
  organization = var.tfe_organization_name
  workspace    = each.value.name

  timeouts {
    create = "1h"
    delete = "1h"
  }

  depends_on = [
    # wait for all vars to be set before triggering run
    tfe_variable.environment,
  ]
}

Debug Output

Please provider a link to a GitHub Gist containing the complete debug output: https://gist.github.com/pedroslopez/fffcbb4f1786246ddea8d84dacfebac5

Gist from a different workspace where I was able to reproduce the issue.

Expected Behavior

What should have happened?

On destroy, the mutlispace_run should have waited up to the configured destroy timeout while the related run was still queued, or ideally it should keep waiting as long as the run is still queued.

Actual Behavior

What actually happened?

After 15 minutes, the run failed with context deadline exceeded. The run triggered by multispace_run eventually ran once the workers became available, but by then the deadline error had already happened.

Steps to Reproduce

This can easily be reproduced in a free terraform cloud organization where there are not enough workers to process the triggered run. Just have the multispace_run trigger a destroy run and see that it only waits up to 15 minutes, failing with context deadline exceeded.

Important Factoids

Are there anything atypical about your accounts that we should know? For example: Running in EC2 Classic? Custom version of OpenStack? Tight ACLs?

Pretty standard Terraform Cloud for Business organization, but we only have 3 workers so if multiple workspaces are being destroyed that take a long time to clean up the resources we run into this issue.

mitchellh commented 2 years ago

Hmmm I would've thought that the Terraform SDK handles that timeout block for me since we're just using the context provided directly from the SDK. I'll have to do some digging!

pedroslopez commented 2 years ago

Oh! So this is interesting. After seeing these errors popup we added the timeout block as shown in the sample config file above, but applying those changes didn't actually set up the timeouts on the state file. The new runs created after we set up the timeout do have them in the state file. I'm assuming the timeouts need to be in the state file so when they're destroyed the right value is used.

The same can be reproduced in a simple usage of the multispace_run resource - setting a timeout after creation or updating the timeout value has no effect (and terraform plan shows no changes). I'm not sure if this is specific to this provider or something on a deeper level, though.

I do see that on resource update here https://github.com/mitchellh/terraform-provider-multispace/blob/main/internal/provider/resource_run.go#L106-L109 nil is simply returned and nothing else is done. Maybe something needs to be done here to update timeouts properly?

mitchellh commented 2 years ago

I don't know either. I think at least partially, this might be worth asking the Terraform core GitHub as well. I'll do some research here too but it might be useful to have the two threads going in case there is any core (or core SDK) issue.

mitchellh / terraform-provider-multispace