hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.62k stars 4.66k forks source link

VM Resources from For_each on destroy due to updates fails with NIC in use error. #18888

Open jeevanions opened 2 years ago

jeevanions commented 2 years ago

Is there an existing issue for this?

Community Note

Terraform Version

1.2.9

AzureRM Provider Version

latest

Affected Resource(s)/Data Source(s)

azurerm_network_interface

Terraform Configuration Files

## Assume that below code is inside a local submodule.

variable "gitlab_runners" {
  description = "Describes how many VMs should be provisionned for gitlab runners installation. Base64 encoded"
  type = string
  # base64 for '{"0":{"disk_size_gb":20,"extra_tags":[],"size":"Standard_B1ms"}}' cause optional is
  # an experimental feature, empty string causes an error and empty base64 encoded object will overwrite
  # default value of module variable anyway
  default = "eyIwIjp7ImRpc2tfc2l6ZV9nYiI6MjAsImV4dHJhX3RhZ3MiOltdLCJzaXplIjoiU3RhbmRhcmRfQjFtcyJ9fQ=="
}

variable "location"{
    default = "eastus"
}

variable "gitlab_runner_name" {
    default ="gitlab-runner"
}

data "template_file" "init" {
  template = file("${path.module}/cloud-init.sh")
  for_each = var.gitlab_runners
  vars = {
    tags = join(",", concat(["azure], each.value.extra_tags))
  }
}

resource "azurerm_resource_group" "resource_group" {
    name     = "gitlab-eastus-rg"
    location = var.location
}

resource "azurerm_virtual_network" "virtual_network" {
    name                = "gitlab-eastus-vnet"
    location            = var.location
    resource_group_name = azurerm_resource_group.resource_group.name
    address_space       = tolist([module.cidr.cidr_block])
}

  resource "azurerm_subnet" "subnet" {
    name                 = "gitlab-eastus-subnet"
    resource_group_name  = azurerm_resource_group.resource_group.name
    virtual_network_name = azurerm_virtual_network.virtual_network.name
    address_prefixes     = ["any valid cidr"]
    service_endpoints    = ["Microsoft.Storage", "Microsoft.KeyVault"]
  }

resource "azurerm_network_interface" "gitlab_runner_nic" {
  name                = "${var.gitlab_runner_name}-${each.key}-nic"
  resource_group_name = module.resource_group.name
  location            = var.location
  tags                = module.label.tags
  for_each = var.gitlab_runners

  ip_configuration {
    name                          = "Configuration"
    subnet_id                     = azurerm_subnet.subnet.id
    private_ip_address_allocation = "Dynamic"
  }
}

resource "azurerm_linux_virtual_machine" "gitlab_runner_vm" {
  for_each              = var.gitlab_runners
  name                  = "${var.gitlab_runner_name}-${each.key}"
  resource_group_name   = azurerm_resource_group.resource_group.name
  location              = var.location 
  network_interface_ids = [azurerm_network_interface.gitlab_runner_nic[each.key].id]
  size                  = each.value.size
  computer_name         = "${var.gitlab_runner_name}-${each.key}"
  admin_username        = "azadmin"
  custom_data           = base64encode(data.template_file.init[each.key].rendered)

  os_disk {
    name                   = "${var.gitlab_runner_name}-${each.key}-disk"
    caching                = "ReadWrite"
    storage_account_type   = "Standard_LRS"
    disk_size_gb           = each.value.disk_size_gb
    disk_encryption_set_id = azurerm_disk_encryption_set.default.id
  }

  admin_ssh_key {
    username   = "azadmin"
    public_key = var.ssh_public_key
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-focal"
    sku       = "20_04-lts"
    version   = "20.04.202201310"
  }

  identity {
    type         = "SystemAssigned"
  }

  depends_on = [
    azurerm_network_interface.gitlab_runner_nic
  ]
}

## Updates to the terraform code
We have updated the terraform to add more VMs in the main and remove the VMs created from the sub-module. On running the terraform plan, it did not complain  but on apply it fails to delete the NIC, indicating that it is in use. But verifying the Azure resources the VM is actually deleted, but for terraform, the NIC is being in used. 

We had to rerun couple of more times `terraform apply` command to fix itself. Azure might still be serving from a cached version of the state, indicating NIC is in use.  From the output below you can see that every one eventually get the right state of the resources from Azure. This behaviour makes it hard to use Terraform for Azure. My fellow DevOps feels that Azure is slow often, we need to find a workaround for these kinds of gotchas. There is never a straight forward way of achieving a requirement.

Debug Output/Panic Output

Run 1: 

 Error: 
 deleting Network Interface: (Name "gitlab-runner1-0-nic" / Resource Group "gitlab-eastus-rg"): 
 network.InterfacesClient#Delete: Failure sending request: StatusCode=400 
 -- Original Error: Code="NicInUse" Message="Network Interface /subscriptions/<subid>/resourceGroups/gitlab-eastus-rg/providers/Microsoft.Network/networkInterfaces/gitlab-runner1-0-nic 
 is used by existing resource 
 /subscriptions/<subid>/resourceGroups/GITLAB-EASTUS-RG/providers/Microsoft.Compute/virtualMachines/gitlab-runner1-0. 
 In order to delete the network interface, it must be dissociated from the resource. To learn more, see aka.ms/deletenic." Details=[]

Run 2: 

Error: deleting Subnet: (Name "gitlab-eastus-subnet" / Virtual Network Name "gitlab-eastus-vnet" / Resource Group "gitlab-eastus-rg"): 
network.SubnetsClient#Delete: Failure sending request: StatusCode=400 
-- Original Error: Code="InUseSubnetCannotBeDeleted" Message="Subnet gitlab-eastus-subnet 
is in use by /subscriptions/<subid>/resourceGroups/gitlab-eastus-rg/providers/Microsoft.Network/networkInterfaces/gitlab-runner1-0-nic/ipConfigurations/Configuration and cannot be deleted. 
In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet." Details=[]

Run 3: Succeeded

Expected Behaviour

It should in the first run remove the resources without any errors

Actual Behaviour

See output

Steps to Reproduce

  1. Use above example to create a terraform config with local submodule then apply the changes
  2. Now modify the terraform removing the VM configs from the submodule and add any random azure resources in the main.tf (not in the module)
  3. Then apply the new terraform code. You can see the plan shows the intended changes which is fine. But on apply it starts with NIC in use error even though the VM that is using the NIC is deleted.
  4. Following applies eventually corrects the state sync.

Important Factoids

No response

References

No response

myc2h6o commented 2 years ago

Hi @jeevanions thanks for opening the issue! Taking a look through the issue, I think there may be two possible reasons:

  1. The resource dependency is broken somehow on/before the second apply, to verify if this is the case, you can check if the very first deleting message of NIC happens after the deleting completion message of VM. Same to Subnet.
  2. The API on azure breaks internally when deleting the VM/NIC causing it to return before the association with other resources is cleaned up, similar to #15728. As this may be environment specific, I suggest open a support ticket at Azure to check the services. You can set the environment TF_LOG to DEBUG as described here to get the Azure API details which may help with the troubleshooting.