Azurerm VM refresh with delete_os_disk_on_termination=true is failing with a cannot find storage account error.

hashibot commented 7 years ago

This issue was originally opened by @djsly as hashicorp/terraform#15228. It was migrated here as part of the provider split. The original body of the issue is below.

Terraform Version

0.9.8

Affected Resource(s)

Please list the resources as a list, for example:

azurerm_virtual_machine

Terraform Configuration Files


provider "azurerm" {
  subscription_id = "${var.subscription_id}"
  client_id       = "${var.client_id}"
  client_secret   = "${var.client_secret}"
  tenant_id       = "${var.tenant_id}"
}

# Create a resource group
resource "azurerm_resource_group" "test" {
  name     = "${var.resource_group}"
  location = "${var.azure_location}"
}

resource "azurerm_virtual_network" "test" {
    name = "slyvnet"
    address_space = "${var.vnet_address_space}"
    location = "${var.azure_location}"
    resource_group_name = "${azurerm_resource_group.test.name}"
}

resource "azurerm_subnet" "test" {
    name = "slyvnetsub"
    resource_group_name = "${azurerm_resource_group.test.name}"
    virtual_network_name = "${azurerm_virtual_network.test.name}"
    address_prefix = "10.1.0.0/24"
}

resource "azurerm_network_interface" "test" {
    count = "${var.counts}"
    name = "slyni${count.index}"
    location = "${var.azure_location}"
    resource_group_name = "${azurerm_resource_group.test.name}"

    ip_configuration {
        name = "testconfiguration1"
        subnet_id = "${azurerm_subnet.test.id}"
        private_ip_address_allocation = "dynamic"
    }
}

resource "azurerm_storage_account" "test" {
    count = "${var.counts}"
    name = "slysa${count.index}"
    resource_group_name = "${azurerm_resource_group.test.name}"
    location = "${var.azure_location}"
    account_type = "Standard_LRS"

}

resource "azurerm_storage_container" "test" {
    count = "${var.counts}"
    name = "vhds"
    resource_group_name = "${azurerm_resource_group.test.name}"
    storage_account_name = "${azurerm_storage_account.test.*.name[count.index]}"
    container_access_type = "private"
}

resource "azurerm_virtual_machine" "test" {
    count = "${var.counts}"
    name = "slyvm${count.index}"
    location = "${var.azure_location}"
    resource_group_name = "${azurerm_resource_group.test.name}"
    network_interface_ids = ["${azurerm_network_interface.test.*.id[count.index]}"]
    vm_size = "Standard_A0"
    delete_os_disk_on_termination = "true"

    storage_image_reference {
        publisher = "Canonical"
        offer = "UbuntuServer"
        sku = "14.04.2-LTS"
        version = "latest"
    }

    storage_os_disk {
        name = "myosdisk1"
        vhd_uri = "${azurerm_storage_account.test.*.primary_blob_endpoint[count.index]}${azurerm_storage_container.test.*.name[count.index]}/myosdisk1.vhd"
        caching = "ReadWrite"
        create_option = "FromImage"
    }

    os_profile {
        computer_name = "hostnamee${count.index}"
        admin_username = "testadmin"
        admin_password = "Password1234!"
    }

    os_profile_linux_config {
        disable_password_authentication = false
    }
}

Debug Output

https://gist.github.com/djsly/11300a541a92432002a843509b1fb1ed

Expected Behavior

the VM refresh should delete the os_disk and proceed with the deletion

Actual Behavior

Errors out trying to delete the blob

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

terraform apply
change the hostname of the VM
terraform apply

tombuildsstuff commented 7 years ago

Hey @djsly

Thanks for opening this issue :)

I've spent some time looking into this but I'm struggling to reproduce this issue - the error message being returned states that the Storage Account doesn't exist (or there's an eventual consistency bug in the API) however I'd expect to be able to reproduce this (and I've been unsuccessful so far).

So that we can investigate this further - would you be able to answer the following:

it appears you're using a script to invoke Terraform - out of interest is it possible that this was run twice?
is it possible that the Storage Account was deleted via another means (i.e. in the portal?)

Thanks!

bpoland commented 7 years ago

I am also seeing this issue pretty frequently. We are using an outside script to invoke Terraform but definitely not running it twice. And I have confirmed that the storage account was not deleted outside Terraform before running the destroy (the OS disk for the VM being destroyed is in that storage account so I wouldn't have been able to delete it before the VM was destroyed).

Interestingly enough, a second attempt to destroy seems to succeed every time, so this does seem like some sort of consistency/timing issue.

I am going to see if I can reproduce when running terraform manually to destroy.

djsly commented 7 years ago

I still can reproduce using the exact same Config File

Error applying plan:

1 error(s) occurred:

* azurerm_virtual_machine.test (destroy): 1 error(s) occurred:

* azurerm_virtual_machine.test: Error deleting OS Disk VHD: Error finding resource group for storage account slysa0: Wrong number of results making resource request for query name eq 'slysa0' and resourceType eq 'Microsoft.Storage/storageAccounts': 0

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.
✘-1 ~/github/sylvain_boily/terraform-playground/issue-102 [master|…2]

djsly commented 7 years ago

the provided configuration files from the OP are missing this

variable "subscription_id" {}
variable "client_id" {}
variable "client_secret" {}
variable "tenant_id" {}

variable "counts" {
 default = "1"
}
variable "resource_group" {
 default = "test"
}
variable "azure_location" {
 default = "eastus2"
}
variable "vnet_address_space" {
 default = ["10.1.0.0/24"]
}

provider "azurerm" {
  subscription_id = "${var.subscription_id}"
  client_id       = "${var.client_id}"
  client_secret   = "${var.client_secret}"
  tenant_id       = "${var.tenant_id}"
}

djsly commented 7 years ago

what I did was to run terraform apply, once completed, edited the azure.tf to update the name of the VM

resource "azurerm_virtual_machine" "test" {
    count = "${var.counts}"
    name = "<NEWNAME>${count.index}"

and reran terraform apply

bpoland commented 7 years ago

In my case, there are no changes to the VM name or anything. I've just deployed a VM and then later want to destroy it, and that's when I see the error. Then when I try destroying again, it succeeds.

bpoland commented 7 years ago

Actually I realized we are detaching a secondary disk right before we delete the VM (this is done using the azure CLI). I wonder if Azure is still propagating that change when the delete comes in?

I am going to try a delete without the secondary disk involved, and one with the secondary disk, and see if that seems to be related.

@djsly did you make any changes to your VM or its configuration outside of terraform?

bpoland commented 7 years ago

I was able to reproduce even without a secondary disk, so that doesn't seem to be it. I created a VM with terraform, waited for a few minutes and then ran "terraform destroy" and saw the issue. I am using a custom VHD file for my VMs, could that be it? It looks like @djsly is also using a custom VHD file.

One other interesting thing -- I noticed that when this error happens, even after I run terraform again to destroy, my VM's OS disk still remains in the storage account (I have delete_os_disk_on_termination set to true). Is this error happening when terraform tries to delete the OS disk after terminating the VM? It seems like the second time through, during the refresh it doesn't find the VM in Azure and so it doesn't try again to destroy the OS disk?

djsly commented 7 years ago

@bpoland

@djsly did you make any changes to your VM or its configuration outside of terraform?

No, I only use Terraform CLI and never log on to the portal.

It looks like @djsly is also using a custom VHD file.

I'm using the official Ubuntu Image as my Base Image for the sake of this example. So no custom VHD

bpoland commented 7 years ago

Ah sorry I see the Ubuntu image in your terraform config above.

@tombuildsstuff did you have delete_os_disk_on_termination = "true" when you were trying it out?

I am struggling to find any common "weird stuff" between @djsly and my configs that could explain why we are the only ones seeing this.

I started to see this issue maybe 3 weeks ago or so, and it didn't seem to be triggered by any changes to my configs (or a new version of terraform). So I was thinking maybe something changed on the Azure side. I just added a retry since that seemed to work (and hoped that Azure would fix things). It would still be nice to know for sure.

djsly commented 7 years ago

FYI: I simply used the official Azure example from terraform's website and I added delete_os_disk_on_termination = "true"

bpoland commented 7 years ago

I've pasted some debug output that I'm getting here: https://gist.github.com/bpoland/dd300ccc387a1671b060d01adb4734e6

A colleague noticed that the response from Azure includes no results but does include a "nextLink" -- is it possible the results are paginated and terraform needs to get the next "page" of results to find the storage account? The Azure subscription I'm working in has a lot of resources so maybe others don't see this if they have fewer resources. @djsly are there a lot of resources in the Azure subscription you're using?

djsly commented 7 years ago

I'm not sure what a lot refers too :) but we have a total of 975 items and 156 storage accounts.

I guess it could be identified as a lot hehe

bpoland commented 7 years ago

Haha yeah hard to say what "a lot" is :)

@tombuildsstuff when you were trying to reproduce, how many resources did you have in your azure subscription? Any thoughts about the pagination? Thanks!

JunyiYi commented 6 years ago

Hi @djsly , I used your tf files and the following steps:

run terraform apply
Change VM name
run terraform apply

But the issue is not reproduced Apply complete! Resources: 1 added, 0 changed, 1 destroyed. Can you confirm whether the issue still exists?

djsly commented 6 years ago

Hi @JunyiYi , we moved to Managed disk so we haven't exercised this logic path for a while. I do not mind closing it as it was probably fixed by now :)

bpoland commented 6 years ago

Has anyone made any changes that they think should fix this problem? I think you need to be using a subscription with a lot of storage accounts in order to see the problem, because some results coming back from Azure are paginated and that causes terraform to not be able to find the storage account.

JunyiYi commented 6 years ago

Thanks @djsly, let me close this issue now. @bpoland , my subscription contains 46 storage accounts. And could you please create a new issue with your terraform HCL and reproduce steps. Let's track only one issue in this thread. Thanks.

bpoland commented 6 years ago

@JunyiYi the issue I experienced is the exact one that @djsly reported in this issue. It seems that in order to reproduce you need to have a large number of storage accounts in the subscription. Could you try creating 50 or 100 more storage accounts temporarily in your subscription and then see if you can reproduce it?

djsly commented 6 years ago

@bpoland is correct, we used to have over 200 storage account (one per VM)

ljfranklin commented 6 years ago

We're still seeing this as well under Terraform v0.11.3, not sure what the provider version was. Same boat as everyone else, destroy fails when delete_os_disk_on_termination=true and we have a large number of storage accounts (>100).

bpoland commented 6 years ago

@JunyiYi @tombuildsstuff would you be able to reopen this issue since it was never actually fixed?

markround commented 6 years ago

I know it's bad form to "bump" or add a "me too", but I just ran into this bug. Please could it be re-opened as it's not fixed?

I have some 60-odd Azure storage accounts holding disk images, delete_os_disk_on_termination=true in the VM config and am seeing this every time when I try and delete a VM which has it's disk image in a storage account name starting with an x, but others (e.g. starting with d or whatever) work fine.

It therefore looks to be exactly the same issue with Terraform not paginating results returned from the Azure API so it assumes the storage account does not exist.

bpoland commented 6 years ago

I ended up moving to managed disks but we still had problems with it before we switched. My workaround was to add a separate azurerm_storage_blob resource for the OS disk:

resource "azurerm_storage_blob" "vm_os" {
    name = "${var.vm_name}-os.vhd"
    resource_group_name = "${var.azure_resource_group}"
    storage_account_name = "${var.storage_account_name}"
    storage_container_name = "vhds"
}

Then in the VM itself turn delete_os_disk_on_termination off and add depends_on = [ "azurerm_storage_blob.vm_os" ]

But this is absolutely still an issue with the provider.

markround commented 6 years ago

@JunyiYi Could this be re-opened please ? Or would you prefer me to create a new (duplicate) issue ? As mentioned above, we're seeing this exact same issue, and for various reasons cannot move to managed disks to work around the problem, or switch to a separate blob store resource.

ghost commented 5 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

hashicorp / terraform-provider-azurerm