Virtual machine creation fails with RetryableError

dbergel commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Community Note

Please vote on this issue by adding a :thumbsup: reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

1.1.9

AzureRM Provider Version

3.0.2

Affected Resource(s)/Data Source(s)

azurerm_linux_virtual_machine azurerm_windows_virtual_machine

Terraform Configuration Files

https://github.com/teradici/Azure_Deployments/tree/master/terraform-deployments/deployments/cas-mgr-load-balancer-one-ip-nat

Debug Output/Panic Output

2022-05-18T15:17:05.635Z [DEBUG] provider.terraform-provider-azurerm_v3.0.2_x5: AzureRM Response for https://management.azure.com/subscriptions/<redacted>/providers/Microsoft.Compute/locations/centralus/operations/90793864-6ba7-488d-b142-0b8128735630?p=7b61c3cd-cc9b-4d18-8ca3-a9c1b12efefd&api-version=2021-11-01: 
HTTP/2.0 200 OK
Cache-Control: no-cache
Content-Type: application/json; charset=utf-8
Date: Wed, 18 May 2022 15:17:05 GMT
Expires: -1
Pragma: no-cache
Server: Microsoft-HTTPAPI/2.0
Server: Microsoft-HTTPAPI/2.0
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Accept-Encoding
X-Content-Type-Options: nosniff
X-Ms-Correlation-Request-Id: 83831b17-0254-51a9-b61d-b1a32674682d
X-Ms-Ratelimit-Remaining-Resource: Microsoft.Compute/GetOperation3Min;14975,Microsoft.Compute/GetOperation30Min;29895
X-Ms-Ratelimit-Remaining-Subscription-Reads: 11998
X-Ms-Request-Id: 903c4659-5b75-4522-a813-d9787a578144
X-Ms-Routing-Request-Id: WESTUS2:20220518T151705Z:fc57f1eb-6172-446a-8d0e-5976ae40a901

{
  "startTime": "2022-05-18T15:16:53.6761603+00:00",
  "endTime": "2022-05-18T15:16:57.7073471+00:00",
  "status": "Failed",
  "error": {
    "code": "RetryableError",
    "message": "A retryable error occurred."
  },
  "name": "90793864-6ba7-488d-b142-0b8128735630"
}: timestamp=2022-05-18T15:17:05.634Z

Expected Behaviour

Virtual machines provisioned successfully, retryable errors automatically retried.

Actual Behaviour

Intermittently workstation provisioning will fail with a basic "retryable error" with no additional information. Not able to reproduce 100%

  Error: waiting for creation of Linux Virtual Machine: (Name "dbj2m-scent-0" / Resource Group "cas-mgr-load-balancer-one-ip-nat-dbj2m"): Code="RetryableError" Message="A retryable error occurred."

    with module.centos-std-vm.azurerm_linux_virtual_machine.centos-std-vm["linux_std_0"],
    on ../../modules/centos-std-vm/main.tf line 50, in resource "azurerm_linux_virtual_machine" "centos-std-vm":
    50: resource "azurerm_linux_virtual_machine" "centos-std-vm" {

  Error: waiting for creation of Windows Virtual Machine: (Name "dbj2m-swin-0" / Resource Group "cas-mgr-load-balancer-one-ip-nat-dbj2m"): Code="RetryableError" Message="A retryable error occurred."

    with module.windows-std-vm.azurerm_windows_virtual_machine.windows-std-vm["windows_std_0"],
    on ../../modules/windows-std-vm/main.tf line 47, in resource "azurerm_windows_virtual_machine" "windows-std-vm":
    47: resource "azurerm_windows_virtual_machine" "windows-std-vm" {

Steps to Reproduce

terraform apply

Important Factoids

No response

References

No response

myc2h6o commented 2 years ago

Hi @dbergel thanks for opening the issue! From the config and the error, I'm not able to identify the root cause, but the Additional information in #8052 may help with the trouble shooting. There was some issue in that issue with creating the VM/VMSS when the load balancer is updating the v-net. Would you be able to find additional details on Azure Portal related to the deployment failure?

ekristen commented 2 years ago

There are two issues at play here, why is azure throwing an error, especially "A retryable error occurred" and how the terraform provider is handling said error.

Since this error is specifically stated to be retryable, this should not be treated as a fatal error, instead it should simply retry the API call.

I see this happen with the same resource, azurerm_linux_virtual_machine, it starts the creation, I get a "Still creating [10s elapsed]" then the error happens and terraform exits. Since this specific error says it's retryable, I would suggest the provider simply retry whenever it encouters this error and then we'd get a "Still creating [20s elapsed]" etc, until it's created, or the internal timer (10-20 minutes) is hit, OR a non-retryable error is encountered.

One additional note: this is a pretty common theme throughout the provider and it makes it very frustrating to use, which is not entirely on the provider itself as the Azure API is just terribly inconsistent, but retrying retryable errors instead of treating them as fatal would go a long way in improving the user experience of this provider.

alok0310 commented 2 years ago

This is really frustrating as we only see this issue when using Azure, but, not AWS. Does it have anything to do with the number of threads being used by Terraform apply?

ekristen commented 2 years ago

@alok0310 I don't believe so. From what I can tell this is due to the Azure API just being junky but at the same time in my opinion and also from what I can tell that the provider is not retrying errors dictated by the Azure API as "retryable" and instead exits hard as if an error occurred.

mtin commented 1 year ago

I also am experiencing these errors. Particularly annoying in pipelines as it exits with a hard error and fails deployment even though the message suggests to just retry (which also works, but has to start our pipelines from the very beginning). Same resource as mentioned above, azurerm_linux_virtual_machine...

eh-michael commented 1 year ago

Hello, I am also experiencing this error when deploying a VM of resource azurerm_windows_virtual_machine. Appreciate any assistance with this. Appreciate anyones insight into this. Happy to run any troubleshooting steps provided.

mathbab commented 1 year ago

Hello, Is there any retry mechanism in place for this issue. I have observed Retryable error and the health history in the azure portal says...

"Unavailable : Resource health event (Unplanned)At Saturday, June 10, 2023 at 5:06:18 AM XXX, the Azure monitoring system received the following information regarding your Virtual machine:Your virtual machine is unavailable at the moment. Please check back in a few minutes for any updates we find on the source of the unavailability of this VM. No additional action is required from you at this time. "

and the VM was already up by the time it was checked in the portal... so a retry from the provider is much needed.

mpo-me commented 1 year ago

Hello,

we experience this problem occasionally on the provisioning of a simple VM at Azure and it makes the azurerm provider very unpredictable and unstable. I also agree that the API is not being used correctly because the error code "RetryableError" communicates to the API consumer that it could and (in my opinion) should be retried.

We have therefore developed a complex logic (many wrapper scripts) around Terraform to manually detect and handle such errors (deleting and re-provisioning), but the whole thing is very messy and makes the use of Terraform absurd.

Is there any news on this topic?

TCDooM commented 1 year ago

same issue here, seems like the provider needs to implement a retry on retriable errors from Azure OR at least not fail without updating the state...

MvRoo commented 11 months ago

We were running into this as well, while creating a VM using the azurerm_linux_virtual_machine resource. We found out through the Azure activity logs that in our case this was caused by parallel updates terraform was doing to the subnet that we were also deploying the VM into. We fixed this by explicitly waiting until the subnet changes were done, using depends_on in the vm resource.

Examples of the entries we found in the logs: Cannot proceed with operation because resource /subscriptions/###/resourceGroups/test/providers/Microsoft.Network/virtualNetworks/test-vnet/subnets/main used by resource /subscriptions/####/resourceGroups/test/providers/Microsoft.Network/networkInterfaces/test-vm-nic is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation.

Cannot proceed with operation because resource /subscriptions/####/resourceGroups/test/providers/Microsoft.Network/virtualNetworks/test-vnet/subnets/main used by resource /subscriptions/####/resourceGroups/test/providers/Microsoft.Network/loadBalancers/test-lb is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation.

damianvandoom commented 1 month ago

I want to add that this issue isn't unique to Terraform.

I deploy via BICEP and have encountered this issue several times.

hashicorp / terraform-provider-azurerm