Open luisfeliz79 opened 5 months ago
we recently updated provisionState for 2023-11-01 API. This likely caused the timeout. Can someone at hashicorp look into this?
@MasterRyd3l would you mind elaborating on how the ProvisioningState
has changed? Checking the API Definition (and the log above) the values for ProvisioningState
look pretty standard - however per the debug log above it appears the API is no longer returning a 404 once this resource has been deleted, which would be an issue?
because delete has become long running. you would need to poll the the location header that gets returned from DELETE call to get either a 200 or 404
@MasterRyd3l FWIW we're already polling on the LRO, so it appears there's potentially something else needed here, is the service using a custom value for the provisioningState
field?
I think our implementation is standard. azure portal is able to recognize our header and poll successfully. Do you have any logs? Any logs from the DeleteAndThenPoll method would help. Or can you point me to the source of that method?
@MasterRyd3l checking the HTTP responses, what's happening is we:
DELETE /subscriptions/XXX/resourceGroups/acctestrg-240516141437412099/providers/Microsoft.Chaos/experiments/acctestcse-mewns?api-version=2023-11-01
) which returns an LROGET :path: /subscriptions/XXX/providers/Microsoft.Chaos/locations/westeurope/operationStatuses/d8f9d5f8-f842-487e-ab02-d2be26a2e9c7?api-version=2023-11-01&t=XXX&c=XXX
) which returns both a Location header (for the original resource URI from 1 - and "status": "succeeded"
in the API response)GET /subscriptions/XXX/resourceGroups/acctestrg-240516141437412099/providers/Microsoft.Chaos/experiments/acctestcse-mewns?api-version=2023-11-01
) to ensure it's actually deleted - but the API returns a 200 OK (so we have to assume it's still around) - whereas other APIs return a 404. We do this because some APIs return that a resource is deleted before it's fully deleted, so this ensures we only mark the resource as deleted when it's actually deleted in Azure)(all of this is done automatically by DeleteThenPoll
)
We ultimately time out since the API continues returning a 200 OK for a deleted resource, rather than a 404 (like other APIs do) which we're polling for. As such this looks to be an API bug where the API is returning 200 rather than a 404 when the resource no longer exists? From our side the resource is gone at the highlighted request (the one above deletes the resource), so provided the API returns a 404 this should work as expected:
I see, thanks. I get the picture now. We'll need to follow up with ARM to understand the expected behavior. Beware though what other RPs behave is not necessarily always the correct way. ARM update their guidance all the time. These behavior changes frequently.
The resource provider can return 200 (OK) or 204 (NoContent) to indicate that the operation completed successfully. A 200 (OK) should be returned if the object exists and was deleted successfully; and a 204 (NoContent) should be used if the resource does not exist and the request is well formed. 202 (Accepted) can be returned to indicate that the operation will complete asynchronously If the resource group does not exist, 404 (NotFound) will be returned by the proxy layer and will not reach the resource provider. 412 (PreconditionFailed) and other normal REST codes are acceptable as long as they match the REST guidelines.
The API flow should be to Respond to the initial DELETE request with a 202 Accepted The response headers MUST include a Location header that points to a URL where the ongoing operation can be monitored Optional: The response headers may include an Azure-AsyncOperation header pointing to an Operation resource If a provisioningState property is used for the resource, it MUST transition to a non-terminal state like "Deleting" If the DELETE completes successfully, the URL that was returned in the Location header MUST now return a 200 OK or 204 NoContent to indicate success and the resource MUST disappear.
I am pretty sure we are implementing to the new standard. I think on your side you need to start anticipating both pattern as more and more team will likely migrate to the new pattern in newer version of their API as well
@MasterRyd3l thanks for the updated information, AFAIK we do support both of those scenarios, but per the last component there:
the resource MUST disappear
Per my comment above the resource is deleted (as confirmed by polling on the LRO) at the highlighted (red) line, however this is continuing to kick around - as such I don't believe the API is correctly implementing this behaviour?
Based on the guidance in your comment, once a resource has been deleted (that is either via a LRO/polling on the Location header, or as a regular delete) - doing a GET on the resource that's been deleted should return a 404, whereas right now the API is returning a 200 OK (with no body).
ah I see what you mean. I misread your previous post. I verified the bug. We'll work on a fix. I'll post back when it's resolved
Awesome, thanks @MasterRyd3l :)
Hi @tombuildsstuff, the fix is expected to complete on Thursday 6/13
Hi @tombuildsstuff, the fix had been rolled out. Can you please confirm if this scenario in question is fixed on your end or not?
Is there an existing issue for this?
Community Note
Terraform Version
1.75
AzureRM Provider Version
3.97.1
Affected Resource(s)/Data Source(s)
azurerm_chaos_studio_experiment
Terraform Configuration Files
Debug Output/Panic Output
Expected Behaviour
The experiment is destroyed and the terraform destroy operation completes successfully.
Actual Behaviour
the terraform destroy operation runs for 30 minutes, then throws this error:
azurerm_chaos_studio_experiment.example: Still destroying... [id=/subscriptions/e31e07c8-2d2c-4c74-9886-...ts/chaos-azurem-test-simple-experiment, 29m51s elapsed] ╷ │ Error: deleting Experiment (Subscription: "xxxxxxxxxxxxxxxxx" │ Resource Group Name: "chaos-azurem-test-rg" │ Experiment Name: "chaos-azurem-test-simple-experiment"): polling after Delete: context deadline exceeded │ │ deleting Experiment (Subscription: "xxxxxxxxxxxxxxxxx" │ Resource Group Name: "chaos-azurem-test-rg" │ Experiment Name: "chaos-azurem-test-simple-experiment"): polling after Delete: context deadline exceeded ╵
Steps to Reproduce
terraform init terraform plan -out my.plan terraform apply my.plan
terraform destroy --auto-approve
Important Factoids
No response
References
If I use the Azure Portal or make an Azure REST API call directly to delete the experiment, using the specific API version below, it experiment deletes properly.
Example:
powershell
$experiment_id='/subscriptions/xxxx-2d2c-xxxx-xxxx-e6fxxxx646/resourceGroups/chaos-azurem-test-rg/providers/Microsoft.Chaos/experiments/chaos-azurem-test-simple-experiment' az rest --method DELETE --uri "https://management.azure.com$($experiment_id)?api-version=2023-04-15-preview"