hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.51k stars 4.6k forks source link

An error was thrown even azurerm_log_analytics_workspace was successfully deleted on the second try #19086

Open thllxb opened 1 year ago

thllxb commented 1 year ago

Is there an existing issue for this?

Community Note

Terraform Version

1.2.1

AzureRM Provider Version

3.28.0

Affected Resource(s)/Data Source(s)

azurerm_log_analytics_workspace

Terraform Configuration Files

resource "azurerm_log_analytics_workspace" "aks" {
  name                = "tf-azure-aks-a34d7170-ci-TestAzureRBAC-eastus-law"
  location            = "eastus"
  resource_group_name = "tf-azure-aks-a34d7170-ci-TestAzureRBAC"
}

Debug Output/Panic Output

https://gist.github.com/thllxb/d095dec20139a99b0fbbb1dce0765303

Expected Behaviour

azurerm_log_analytics_workspace gets deleted by a retry, after the initial 504 gateway error, without throwing an error.

Actual Behaviour

azurerm_log_analytics_workspace got deleted by a retry, after the initial 504 gateway error, and STILL threw the 504 error.

Error: issuing AzureRM delete request for Log Analytics Workspaces 'tf-azure-aks-a34d7170-ci-TestAzureRBAC-eastus-law': performing Delete: workspaces.WorkspacesClient#Delete: Failure sending request: StatusCode=0 -- Original Error: Code="Failed" Message="The async operation failed."

Steps to Reproduce

  1. terraform apply
  2. terraform destroy

Important Factoids

No response

References

No response

ziyeqf commented 1 year ago

Hi @thllxb, thanks for opening the issue.

Does "retry" mean you re-runed terraform destroy?

In the log file, it stopped when the delete operation of azurerm_log_analytics_workspace failed, and the error is from the Azure service backend. I assume a service fluctuation and retry later would be helpful.

thllxb commented 1 year ago

@ziyeqf What I meant by retry is terraform retry the delete automatically after the first 504 error. From the debug log, you can see the first delete happened on line1 with the 504 error respond on line152. Then a second delete request was sent on line199 with a 404 respond on line212 confirming the resource was already gone. That looks like a retry to me. By now terraform was performing well. What I think the issue is, after the second delete request got a 404 respond, it should just be done, but instead, terraform threw an error on line235 complaining The async operation failed which I assume was from the 504 error. That is what needs to be fixed.

ziyeqf commented 1 year ago

@thllxb Thank you for reply. And yes, you are right, it retied after the first request returned 504. Then, it failed because the 404 response. the DELETE request will return 200/202 normally. I assume the first request successfully deleted the resource although it returned 504, then the second request failed. In this case manually re-running terraform destroy works because it will refresh the state file.

thllxb commented 1 year ago

@ziyeqf I agree that a second 'tf destroy' would work fine, but actually we don't have to. The 404 response is already a confirmation of the deletion even though it is a error code. Can we consider 404 acceptable for a delete request?

ziyeqf commented 1 year ago

@thllxb Yes..I think for the delete operation in this scenario 404 is an acceptable response, but there is a little difference between 404 "resource not found" and 200 "delete successfully".

If we take 404 as success, then it might ignore the difference between the local state file and the real cloud state in other scenarios. I don't think the difference should be handled by a 404-returned delete operation.

thllxb commented 1 year ago

@ziyeqf I'm not following what you mean exactly. There is nothing in the cloud after the 404. Why do we still care about remote state?

manicminer commented 1 year ago

@thllxb Thanks for reporting this. It looks like the cause here is a broken API. Unfortunately there isn't much we can do about the API returning 504 whilst actually performing the delete request. We should correctly acknowledge a 404 but whenever a 5xx error is received we can't make assumptions about whether the request was processed or not.

thllxb commented 1 year ago

@manicminer

whenever a 5xx error is received we can't make assumptions about whether the request was processed or not.

You are correct. 5xx should not mean any assumption but be followed with a retry which what it is already doing.

We should correctly acknowledge a 404

You are correct. That's exactly my ask in this ticket.

manicminer commented 1 year ago

Thanks for clarifying. Given that the StatusCode embedded in that error is 0 (zero) and not 404, it's likely that go-autorest is not handling this correctly and we should be able to improve our handling here once we moved moved away from go-autorest onto a different transport SDK (in the near future).