hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.6k stars 4.64k forks source link

Terraform destroy hangs for 5 minutes exactly when refreshing state #15792

Open timwebster9 opened 2 years ago

timwebster9 commented 2 years ago

Community Note

Terraform (and AzureRM Provider) Version

Terraform v1.0.11
azurerm v2.97.0

Affected Resource(s)

Terraform Configuration Files

I'm not certain this matters - it seems to usually be the private endpoint it hangs on, but in the log snippets below you'll see hangs on different types of resources.

resource "azurerm_private_endpoint" "str" {
  name                = "${var.global_settings.resource-prefix}-str-pe"
  location            = azurerm_resource_group.policy_test.location
  resource_group_name = azurerm_resource_group.policy_test.name
  subnet_id           = azurerm_subnet.endpoint-str.id

  private_service_connection {
    name                           = "${var.global_settings.resource-prefix}-str-psc"
    private_connection_resource_id = azurerm_storage_account.policy_test.id
    subresource_names              = ["blob"]
    is_manual_connection           = false
  }

  tags = merge(var.default_tags,{})
}

Debug Output

I have a complete trace file, but cannot include here for privacy reasons. Below are snippets of what it's stuck on.

NOTE: these are from two different builds. In the first one the delay seems to be on a databricks workspace, and the second one seems to be on a private endpoint.

without trace (note the 5 minute delay between log messages):

2022-03-10T14:18:43.1176949Z azurerm_databricks_workspace.databricks: Refreshing state... [id=/subscriptions/***/resourceGroups/***/providers/Microsoft.Databricks/workspaces/example-workspace]

2022-03-10T14:23:42.7266280Z azurerm_private_endpoint.str: Refreshing state... [id=/subscriptions/***/resourceGroups/***/providers/Microsoft.Network/privateEndpoints/dev-test-uks-str-pe]

with trace (this continues for exactly 5 minutes):

2022-03-10T14:36:21.2962242Z 2022-03-10T14:36:21.295Z [TRACE] dag/walk: vertex "azurerm_private_endpoint.str (expand)" is waiting for "azurerm_storage_account.policy_test (expand)"
2022-03-10T14:36:21.9907808Z 2022-03-10T14:36:21.990Z [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "azurerm_private_endpoint.str (expand)"
2022-03-10T14:36:22.0029122Z 2022-03-10T14:36:22.002Z [TRACE] dag/walk: vertex "root" is waiting for "provider[\"registry.terraform.io/hashicorp/azurerm\"] (close)"
2022-03-10T14:36:22.0293905Z 2022-03-10T14:36:22.028Z [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/hashicorp/azurerm\"] (close)" is waiting for "azurerm_private_endpoint.str (expand)"
2022-03-10T14:36:26.2972287Z 2022-03-10T14:36:26.296Z [TRACE] dag/walk: vertex "azurerm_private_endpoint.str (expand)" is waiting for "azurerm_storage_account.policy_test (expand)"
2022-03-10T14:36:26.9912249Z 2022-03-10T14:36:26.990Z [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "azurerm_private_endpoint.str (expand)"
2022-03-10T14:36:27.0032895Z 2022-03-10T14:36:27.002Z [TRACE] dag/walk: vertex "root" is waiting for "provider[\"registry.terraform.io/hashicorp/azurerm\"] (close)"
2022-03-10T14:36:27.0294758Z 2022-03-10T14:36:27.028Z [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/hashicorp/azurerm\"] (close)" is waiting for "azurerm_private_endpoint.str (expand)"
2022-03-10T14:36:31.2986301Z 2022-03-10T14:36:31.298Z [TRACE] dag/walk: vertex "azurerm_private_endpoint.str (expand)" is waiting for "azurerm_storage_account.policy_test (expand)"
2022-03-10T14:36:31.9914146Z 2022-03-10T14:36:31.990Z [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "azurerm_private_endpoint.str (expand)"
2022-03-10T14:36:32.0033661Z 2022-03-10T14:36:32.002Z [TRACE] dag/walk: vertex "root" is waiting for "provider[\"registry.terraform.io/hashicorp/azurerm\"] (close)"
2022-03-10T14:36:32.0296305Z 2022-03-10T14:36:32.029Z [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/hashicorp/azurerm\"] (close)" is waiting for "azurerm_private_endpoint.str (expand)"
2022-03-10T14:36:36.2998304Z 2022-03-10T14:36:36.299Z [TRACE] dag/walk: vertex "azurerm_private_endpoint.str (expand)" is waiting for "azurerm_storage_account.policy_test (expand)"
2022-03-10T14:36:36.9917215Z 2022-03-10T14:36:36.991Z [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "azurerm_private_endpoint.str (expand)"
2022-03-10T14:36:37.0038936Z 2022-03-10T14:36:37.003Z [TRACE] dag/walk: vertex "root" is waiting for "provider[\"registry.terraform.io/hashicorp/azurerm\"] (close)"

Panic Output

n/a

Expected Behaviour

It shouldn't hang

Actual Behaviour

It hangs at this point for pretty much exactly 5 minutes. After the 5 minute delay, it continues on successfully.

Steps to Reproduce

run 'terraform destroy` on the configuration

Important Factoids

n/a

References

n/a

ms-zhenhua commented 2 years ago

Hi @timwebster9, I tried to reproduce this issue with the following configuration for dozens of times, but the issue never happened. How often did you meet this issue ? Could you provide an executable configuration which can steadily reproduce such issue ? Then I can do more investigation on it. Thanks.

resource "azurerm_resource_group" "test" {
  name     = "demo-rg"
  location = "westus2"
}

resource "azurerm_virtual_network" "test" {
  name                = "acctest-vnet-demo"
  location            = azurerm_resource_group.test.location
  resource_group_name = azurerm_resource_group.test.name
  address_space       = ["10.0.0.0/16"]
}

resource "azurerm_subnet" "public" {
  name                 = "acctest-sn-public-demo"
  resource_group_name  = azurerm_resource_group.test.name
  virtual_network_name = azurerm_virtual_network.test.name
  address_prefixes     = ["10.0.1.0/24"]

  delegation {
    name = "acctest"

    service_delegation {
      name = "Microsoft.Databricks/workspaces"

      actions = [
        "Microsoft.Network/virtualNetworks/subnets/join/action",
        "Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
        "Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action",
      ]
    }
  }
}

resource "azurerm_subnet" "private" {
  name                 = "acctest-sn-private-demo"
  resource_group_name  = azurerm_resource_group.test.name
  virtual_network_name = azurerm_virtual_network.test.name
  address_prefixes     = ["10.0.2.0/24"]

  delegation {
    name = "acctest"

    service_delegation {
      name = "Microsoft.Databricks/workspaces"

      actions = [
        "Microsoft.Network/virtualNetworks/subnets/join/action",
        "Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
        "Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action",
      ]
    }
  }
}

resource "azurerm_subnet" "privatelink" {
  name                 = "acctest-snpl-demo"
  resource_group_name  = azurerm_resource_group.test.name
  virtual_network_name = azurerm_virtual_network.test.name
  address_prefixes     = ["10.0.3.0/24"]

  enforce_private_link_endpoint_network_policies = true
}

resource "azurerm_network_security_group" "nsg" {
  name                = "acctest-nsg-demo"
  location            = azurerm_resource_group.test.location
  resource_group_name = azurerm_resource_group.test.name
}

resource "azurerm_subnet_network_security_group_association" "public" {
  subnet_id                 = azurerm_subnet.public.id
  network_security_group_id = azurerm_network_security_group.nsg.id
}

resource "azurerm_subnet_network_security_group_association" "private" {
  subnet_id                 = azurerm_subnet.private.id
  network_security_group_id = azurerm_network_security_group.nsg.id
}

resource "azurerm_databricks_workspace" "test" {
  name                        = "acctestDBW-demo"
  resource_group_name         = azurerm_resource_group.test.name
  location                    = azurerm_resource_group.test.location
  sku                         = "premium"
  managed_resource_group_name = "acctestRG-DBW-demo-managed"

  public_network_access_enabled         = false
  network_security_group_rules_required = "NoAzureDatabricksRules"

  custom_parameters {
    no_public_ip        = true
    public_subnet_name  = azurerm_subnet.public.name
    private_subnet_name = azurerm_subnet.private.name
    virtual_network_id  = azurerm_virtual_network.test.id

    public_subnet_network_security_group_association_id  = azurerm_subnet_network_security_group_association.public.id
    private_subnet_network_security_group_association_id = azurerm_subnet_network_security_group_association.private.id
  }

  tags = {
    Environment = "Production"
    Pricing     = "Premium"
  }
}

resource "azurerm_private_endpoint" "databricks" {
  name                = "acctest-endpoint-demo"
  location            = azurerm_resource_group.test.location
  resource_group_name = azurerm_resource_group.test.name
  subnet_id           = azurerm_subnet.privatelink.id

  private_service_connection {
    name                           = "acctest-psc-demo"
    is_manual_connection           = false
    private_connection_resource_id = azurerm_databricks_workspace.test.id
    subresource_names              = ["databricks_ui_api"]
  }
}
timwebster9 commented 2 years ago

hi @ms-zhenhua,

I've included the exact configuration in the attached zip file. The only changes I've made to it are the tag values, and removed the tenant ID from the Key Vault ARM template (which you can just sub with your own).

I haven't tried to reproduce this in a different environment, so hopefully you can reproduce it. I was hoping the 5 minute thing would trigger someone's memory, as it seems like a deliberate timeout/wait setting somewhere.

This was normally run from a private Azure DevOps build agent using a private-linked storage account for Terraform state. It is the only configuration to have the issue. It was just a testing pipeline so I've disabled it for now (it also had other general flakiness issues).

test.zip

joshua-hancox commented 1 year ago

I also have the same issue with a private endpoint. For me though, this is any action after the endpoint is created, including terraform plan.

I see a similar issue here: https://github.com/hashicorp/terraform-provider-azurerm/issues/16754

terraform version: v1.3.5 azurerm version: v3.39.1

When checking the logs to see the time delay, I see a message that I do not see for any other resource, and relates to timeouts:

.timeouts: planned for existence but config wants absence

2023-03-01T17:44:00.168Z [WARN] Provider "registry.terraform.io/hashicorp/azurerm" produced an invalid plan for module.uks_recovery_services_vaults_backup["1"].azurerm_private_endpoint.secondary_recovery_services_vault_private_endpoints["AzureBackup"], but we are tolerating it because it is using the legacy plugin SDK. The following problems may be the cause of any confusing errors from downstream operations:

  • .custom_network_interface_name: planned value cty.StringVal("") for a non-computed attribute
  • .timeouts: planned for existence but config wants absence
  • .private_service_connection[0].request_message: planned value cty.StringVal("") for a non-computed attribute
  • .private_service_connection[0].private_connection_resource_alias: planned value cty.StringVal("") for a non-computed attribute 2023-03-01T17:48:56.950Z [WARN]..... next log

If I look in the state file, timeouts for every other resource looks like this:

"timeouts":  null

but timeouts for the private endpoint looks like this:

"timeouts": {
  "create": null,
  "delete": null,
  "read": null,
  "update": null
}

I can't see any difference in implementation in the provider though.

Not sure if any of this helps.

theumannsuncor commented 1 year ago

In terms of 5 minute delays, I've seen this when you hit your API limits on Azure. So, depending on what you're doing, if you pass the allotted amount of API calls in 5 minutes, you will get locked, possibly for 5 minutes. Just a guess. It's the only place I've seen a '5 minute' delay before. I get it with Azure Storage Explorer or if I'm doing too many API calls creating/destroying storage accounts/blobs. TF hangs until the API opens up again and lets more calls through.

https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling

duncanaf commented 1 year ago

This is related to #20257. If you cannot connect to the storage account over the data plane (e.g. because it's only available from a private endpoint and/or DNS resolution fails), then terraform hangs for 5 minutes, and then seems to carry on without error.