Removal of Backend Address Pool Association Fails

LukasNajman commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Community Note

Please vote on this issue by adding a :thumbsup: reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

1.3.7

AzureRM Provider Version

3.39.1

Affected Resource(s)/Data Source(s)

azurerm_linux_virtual_machine, azurerm_lb_backend_address_pool, azurerm_network_interface_backend_address_pool_association

Terraform Configuration Files

variable "vm_count" {
  type    = number
  default = 10
}

variable "rg_name" {
  type    = string
  default = "azurerm-bug-repro"
}

variable "region" {
  type    = string
  default = "westeurope"
}

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "3.39.1"
    }
  }

  required_version = ">= 1.3.7"
}

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "test-rg" {
  name     = var.rg_name
  location = var.region
}

# NETWORK
resource "azurerm_virtual_network" "vnet" {
  name                = "vnet-test"
  location            = var.region
  resource_group_name = azurerm_resource_group.test-rg.name

  address_space = ["10.10.0.0/16"]
}

resource "azurerm_subnet" "subnet" {
  name                = "snet-test"
  resource_group_name = azurerm_resource_group.test-rg.name

  virtual_network_name = azurerm_virtual_network.vnet.name
  address_prefixes     = ["10.10.1.0/24"]
}

resource "azurerm_network_interface" "vm-nics" {
  count = var.vm_count

  name                = "nic-test-${count.index}"
  location            = var.region
  resource_group_name = azurerm_resource_group.test-rg.name

  ip_configuration {
    name                          = "ipconf-test-${count.index}"
    subnet_id                     = azurerm_subnet.subnet.id
    private_ip_address_allocation = "Dynamic"
  }
}

# VMs
resource "azurerm_linux_virtual_machine" "vms" {
  count = var.vm_count

  location            = var.region
  resource_group_name = azurerm_resource_group.test-rg.name
  name                = "vm-test-${count.index}"

  size = "Standard_B1ls"
  zone = 1

  network_interface_ids = [
    azurerm_network_interface.vm-nics[count.index].id
  ]

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-minimal-jammy"
    sku       = "minimal-22_04-lts-gen2"
    version   = "latest"
  }

  os_disk {
    caching = "ReadOnly"
    # Ephemeral OS disk is supported for VMs using Standard LRS storage account type only.
    storage_account_type = "Standard_LRS"
    disk_size_gb         = 40
  }

  admin_username                  = "ubuntu"
  admin_password                  = "Pass123*"
  disable_password_authentication = false
}

# LB
resource "azurerm_public_ip" "lb" {
  name                = "pip-test"
  location            = var.region
  resource_group_name = azurerm_resource_group.test-rg.name

  sku               = "Standard"
  allocation_method = "Static"
}

resource "azurerm_lb" "lb" {
  name                = "lbe-test"
  location            = var.region
  resource_group_name = azurerm_resource_group.test-rg.name

  sku = "Standard"
  frontend_ip_configuration {
    name                 = "ip-conf-lb-test"
    public_ip_address_id = azurerm_public_ip.lb.id
  }
}

resource "azurerm_lb_backend_address_pool" "lb" {
  name            = "lbe-pool-test"
  loadbalancer_id = azurerm_lb.lb.id
}

resource "azurerm_network_interface_backend_address_pool_association" "lb" {
  count = length(azurerm_network_interface.vm-nics)

  ip_configuration_name   = azurerm_network_interface.vm-nics[count.index].ip_configuration[0].name
  network_interface_id    = azurerm_network_interface.vm-nics[count.index].id
  backend_address_pool_id = azurerm_lb_backend_address_pool.lb.id
}

Debug Output/Panic Output

https://gist.github.com/LukasNajman/ee9efc523767ab2bb0a715ffce2d6262

Expected Behaviour

Resources created using terraform apply should be destroyable with terraform destroy.

Actual Behaviour

The resource deletion fails with an error:

╷
│ Error: waiting for removal of Backend Address Pool Association for NIC "nic-test-6" (Resource Group "azurerm-bug-repro"): Code="OperationNotAllowed" Message="Operation 'startTenantUpdate' is not allowed on VM 'vm-test-6' since the VM is marked for deletion. You can only retry the Delete operation (or wait for an ongoing one to complete)." Details=[]
│ 
│ 
╵
╷
│ Error: waiting for removal of Backend Address Pool Association for NIC "nic-test-3" (Resource Group "azurerm-bug-repro"): Code="OperationNotAllowed" Message="Operation 'startTenantUpdate' is not allowed on VM 'vm-test-3' since the VM is marked for deletion. You can only retry the Delete operation (or wait for an ongoing one to complete)." Details=[]
│ 
│ 
╵

With 10 VMs, it failed 3 times out of 3. The problem does not happen with 3 VMs.

Adding explicit dependency from azurerm_network_interface_backend_address_pool_association to azurerm_linux_virtual_machine helps. Hovewer, I consider this a workaround, not a fix, as it is not usable when VMs and load balancer are created in different modules.

resource "azurerm_network_interface_backend_address_pool_association" "lb" { 
  depends_on = [azurerm_linux_virtual_machine.vms]
   ...

Steps to Reproduce

terraform apply terraform destroy

Important Factoids

Running in westeurope

References

https://github.com/hashicorp/terraform-provider-azurerm/issues/4330

wuxu92 commented 1 year ago

Hi @LukasNajman is this comment helps on this issue? https://github.com/hashicorp/terraform-provider-azurerm/issues/4330#issuecomment-546018260

Similar to the azurerm_lb_backend_address_pool resource - Azure allows adding a VM to a LB's Backend Address Pool asynchronously during creation but during deletion the ordering matters unfortunately.

LukasNajman commented 1 year ago

Hi @wuxu92, thanks for the comment. I am aware of it and can confirm, that adding an explicit dependency from the azurerm_network_interface_backend_address_pool_association to azurerm_linux_virtual_machine will solve the problem. Dependency in the inverse order will also work.

But still, I see that as a workaround with limitations. For example in my case, I am creating load balancer (and thus azurerm_network_interface_backend_address_pool_association resource) in a module other than the virtual machines. To create the dependency, I would need to pass virtual machines as an input variable to load balancer module. Unfortunately, that will not work, as Terraform requires dependencies to be declared statically, not through variables.

I can declare dependency of the whole load balancer module on the VM module, and it will solve the problem too. But that does not seem right to me. Mainly, because it is not intuitive, and there is no way to enforce it.

From what I understand, the problem is that deletion of the VMs and BE pool association runs concurrently. And there are two possible outcomes to that:

VM is deleted before BE pool association. Than, I can see following error message in the logs. {"error":{"code":"ResourceNotFound","message":"The Resource 'Microsoft.Compute/virtualMachines/vm-test-8' under resource group 'azurerm-bug-repro' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"}}: timestamp=2023-01-17T12:53:29.835+0100 I supposed that may be because azurerm is trying to modify the VM resource after the completion of the delete operation of the BE pool association. But it seems this error is ignored and does not cause the whole destroy operation to fail.
BE pool association is deleted before VM. In that case, we are getting the error from the bug report. waiting for removal of Backend Address Pool Association for NIC "nic-test-6" (Resource Group "azurerm-bug-repro"): Code="OperationNotAllowed" Message="Operation 'startTenantUpdate' is not allowed on VM 'vm-test-6' since the VM is marked for deletion. You can only retry the Delete operation (or wait for an ongoing one to complete)." Details=[] I believe, this is the azurerm trying to modify the VM resource, but failing to do that because the resource is already marked for deletion. Hovewer, this error is not ignored and causes the whole operation to fail.

Apart from the described workaround, I see two solutions.

If possible, detect and auto-create dependency between VMs and BE pool association.
Ignore the OperationNotAllowed error as it is with ResourceNotFound.

Dependency graph without explicit dependency without_explicit_dependency

Dependency graph with explicit dependency with_explicit_dependency

manicminer commented 1 year ago

Hi @LukasNajman thanks for raising this and for the additional suggestions. Usually we would be unable to fix this from the provider as the correct order of operations can only be effected by the dependency graph - and so where no implicit dependency is inferred you must explicitly create one.

However in this case it might be possible to parse the error and infer that the association is being deleted because the VM is undergoing deletion. To achieve this though will likely require use of our upcoming transport layer, which in turn will probably require the entire network package be migrated and this will take some time.

hashicorp / terraform-provider-azurerm