Open mwardbopp opened 1 year ago
Hi @mwardbopp thanks for opening the issue! The provider usually relies on the retry logic in the SDK itself and not adding additional retry. The debug log seems not having the error detail, would you mind sharing the error code and error detail you see when it fails? We may be able to add some retry logic based on the use cases
azurerm_network_interface_backend_address_pool_association.f5vm01: Creating... module.bigip2.azurerm_virtual_machine_extension.vmext: Still creating... [30s elapsed] azurerm_network_interface_backend_address_pool_association.f5vm01: Creation complete after 5s [id=/subscriptions/c8bd4483-a1a2-47c4-acc8-4a49fbf180f3/resourceGroups/mydemo456-rg-6d99/providers/Microsoft.Network/networkInterfaces/mydemo456-6c9e-ext-nic-public-0/ipConfigurations/mydemo456-6c9e-secondary-ext-public-ip-0|/subscriptions/c8bd4483-a1a2-47c4-acc8-4a49fbf180f3/resourceGroups/mydemo456-rg-6d99/providers/Microsoft.Network/loadBalancers/mydemo456-lb-6d99/backendAddressPools/BackendPool1] module.bigip.azurerm_virtual_machine_extension.vmext: Still creating... [10s elapsed] module.bigip.azurerm_virtual_machine_extension.vmext: Still creating... [20s elapsed] module.bigip.azurerm_virtual_machine_extension.vmext: Still creating... [30s elapsed] ╷ │ Error: Code="RetryableError" Message="A retryable error occurred." │ │ with module.bigip.azurerm_virtual_machine_extension.vmext, │ on .terraform/modules/bigip/main.tf line 490, in resource "azurerm_virtual_machine_extension" "vmext": │ 490: resource "azurerm_virtual_machine_extension" "vmext" { │ ╵ ╷ │ Error: Code="RetryableError" Message="A retryable error occurred." │ │ with module.bigip2.azurerm_virtual_machine_extension.vmext, │ on .terraform/modules/bigip2/main.tf line 490, in resource "azurerm_virtual_machine_extension" "vmext": │ 490: resource "azurerm_virtual_machine_extension" "vmext" { │
The challenge is that the "extension resource" is created successfully but Terraform only creates the state object if the extension's provisioining status is successful.
The resource should be created as soon as a 200/201 response is received from the REST API. If during the wait cycle for the provisioning results in a Provisioning Failed, the TF resource should be flagged as tainted and stored in the state file as such. Then on a subsequent apply, the failed extension would be removed and re-deployed.
We are still seeing this behaviour on provider version 3.116.0. Our module creates 3 VM extension resources, 1 of them passes while the other 2 throw similar error logs:
[2024-08-30T04:54:16.715Z] │ Error: creating/updating Extension (Subscription: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
[2024-08-30T04:54:16.715Z] │ Resource Group Name: "xxxxx"
[2024-08-30T04:54:16.715Z] │ Virtual Machine Name: "az-release-2-7-x-jumpbox"
[2024-08-30T04:54:16.715Z] │ Extension Name: "AzureMonitorLinuxAgent"): polling after CreateOrUpdate: polling failed: the Azure API returned the following error:
[2024-08-30T04:54:16.715Z] │
[2024-08-30T04:54:16.715Z] │ Status: "RetryableError"
[2024-08-30T04:54:16.715Z] │ Code: ""
[2024-08-30T04:54:16.715Z] │ Message: "A retryable error occurred."
[2024-08-30T04:54:16.715Z] │ Activity Id: ""
[2024-08-30T04:54:16.715Z] │
[2024-08-30T04:54:16.716Z] │ ---
[2024-08-30T04:54:16.716Z] │
[2024-08-30T04:54:16.716Z] │ API Response:
[2024-08-30T04:54:16.716Z] │
[2024-08-30T04:54:16.716Z] │ ----[start]----
[2024-08-30T04:54:16.716Z] │ {
[2024-08-30T04:54:16.716Z] │ "startTime": "2024-08-30T04:38:32.1052565+00:00",
[2024-08-30T04:54:16.716Z] │ "endTime": "2024-08-30T04:38:33.1989923+00:00",
[2024-08-30T04:54:16.716Z] │ "status": "Failed",
[2024-08-30T04:54:16.716Z] │ "error": {
[2024-08-30T04:54:16.716Z] │ "code": "RetryableError",
[2024-08-30T04:54:16.716Z] │ "message": "A retryable error occurred."
[2024-08-30T04:54:16.716Z] │ },
[2024-08-30T04:54:16.716Z] │ "name": "c2876690-c24d-4aa1-85f2-fed7f3387c76"
[2024-08-30T04:54:16.716Z] │ }
[2024-08-30T04:54:16.716Z] │ -----[end]-----
[2024-08-30T04:54:16.716Z] │
[2024-08-30T04:54:16.716Z] │
[2024-08-30T04:54:16.716Z] │ with module.jumpbox[0].azurerm_virtual_machine_extension.azure_monitor_linux_agent,
[2024-08-30T04:54:16.716Z] │ on .terraform/modules/jumpbox/jumpbox/monitoring.tf line 3, in resource "azurerm_virtual_machine_extension" "azure_monitor_linux_agent":
[2024-08-30T04:54:16.716Z] │ 3: resource "azurerm_virtual_machine_extension" "azure_monitor_linux_agent" {
Is there an existing issue for this?
Community Note
Terraform Version
1.3.5
AzureRM Provider Version
3.38.0
Affected Resource(s)/Data Source(s)
azurerm_virtual_machine_extension.vmext
Terraform Configuration Files
Debug Output/Panic Output
Expected Behaviour
A retryable error should be handled by the provider.
Actual Behaviour
It fails to either delete the resources or report the creation successfully.
Steps to Reproduce
terraform apply or delete, on a busy region like UKSouth. It doesn't happen nearly as much with WestEurope
Important Factoids
No response
References
No response