hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.52k stars 4.6k forks source link

azurerm_chaos_studio_experiment destroy operation times out #25462

Open luisfeliz79 opened 5 months ago

luisfeliz79 commented 5 months ago

Is there an existing issue for this?

Community Note

Terraform Version

1.75

AzureRM Provider Version

3.97.1

Affected Resource(s)/Data Source(s)

azurerm_chaos_studio_experiment

Terraform Configuration Files

terraform {
  required_providers {
    azurerm = {
      source = "hashicorp/azurerm"
      version = ">=3.97.1"
    }
  }
}

provider "azurerm" {
  features {}
}

# Define some local variables
locals {
  prefix_name="chaos-azurem-test"
  location="eastus"
}

# Create a resource group 
resource "azurerm_resource_group" "rg" {
  name     = "${local.prefix_name}-rg"
  location = local.location
}

# Create an NSG
resource "azurerm_network_security_group" "test_nsg" { 
    name                        = "chaos-controlled-nsg"
    resource_group_name = azurerm_resource_group.rg.name
    location            = azurerm_resource_group.rg.location
}

# Onboard the NSG as a Chaos Studio target
resource "azurerm_chaos_studio_target" "nsg" {
  location           = azurerm_resource_group.rg.location
  target_resource_id = azurerm_network_security_group.test_nsg.id
  target_type        = "Microsoft-NetworkSecurityGroup"

}

# Create a very simple experiment
# It doesnt actually do anything other to run through a delay
# this is just for illustration purposes for the issue
resource "azurerm_chaos_studio_experiment" "example" {
  location            = azurerm_resource_group.rg.location
  name                = "${local.prefix_name}-simple-experiment"
  resource_group_name = azurerm_resource_group.rg.name

  selectors {
    name                    = "Selector1"    
    chaos_studio_target_ids = [azurerm_chaos_studio_target.nsg.id]
  }

 steps {
    name = "Step 2 - Wait 10 minutes"
    branch {
      name = "Branch 1"
      actions {
        urn           = "urn:csci:microsoft:chaosStudio:timedDelay/1.0"
        # Ref - https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-fault-library#sample-json

        action_type   = "delay"
        duration      = "PT10M"  # ISO 8601 time format
      }
    }
  }

}

# Some output information
output experiment_id {
  value = azurerm_chaos_studio_experiment.example.id
}

Debug Output/Panic Output

https://gist.github.com/luisfeliz79/bc85cd104b69810a1f95851c651d463a

Expected Behaviour

The experiment is destroyed and the terraform destroy operation completes successfully.

Actual Behaviour

the terraform destroy operation runs for 30 minutes, then throws this error:

azurerm_chaos_studio_experiment.example: Still destroying... [id=/subscriptions/e31e07c8-2d2c-4c74-9886-...ts/chaos-azurem-test-simple-experiment, 29m51s elapsed] ╷ │ Error: deleting Experiment (Subscription: "xxxxxxxxxxxxxxxxx" │ Resource Group Name: "chaos-azurem-test-rg" │ Experiment Name: "chaos-azurem-test-simple-experiment"): polling after Delete: context deadline exceeded │ │ deleting Experiment (Subscription: "xxxxxxxxxxxxxxxxx" │ Resource Group Name: "chaos-azurem-test-rg" │ Experiment Name: "chaos-azurem-test-simple-experiment"): polling after Delete: context deadline exceeded ╵

Steps to Reproduce

terraform init terraform plan -out my.plan terraform apply my.plan

terraform destroy --auto-approve

Important Factoids

No response

References

If I use the Azure Portal or make an Azure REST API call directly to delete the experiment, using the specific API version below, it experiment deletes properly.

Example:

powershell

$experiment_id='/subscriptions/xxxx-2d2c-xxxx-xxxx-e6fxxxx646/resourceGroups/chaos-azurem-test-rg/providers/Microsoft.Chaos/experiments/chaos-azurem-test-simple-experiment' az rest --method DELETE --uri "https://management.azure.com$($experiment_id)?api-version=2023-04-15-preview"

MasterRyd3l commented 5 months ago

we recently updated provisionState for 2023-11-01 API. This likely caused the timeout. Can someone at hashicorp look into this?

tombuildsstuff commented 5 months ago

@MasterRyd3l would you mind elaborating on how the ProvisioningState has changed? Checking the API Definition (and the log above) the values for ProvisioningState look pretty standard - however per the debug log above it appears the API is no longer returning a 404 once this resource has been deleted, which would be an issue?

MasterRyd3l commented 4 months ago

because delete has become long running. you would need to poll the the location header that gets returned from DELETE call to get either a 200 or 404

tombuildsstuff commented 4 months ago

@MasterRyd3l FWIW we're already polling on the LRO, so it appears there's potentially something else needed here, is the service using a custom value for the provisioningState field?

MasterRyd3l commented 4 months ago

I think our implementation is standard. azure portal is able to recognize our header and poll successfully. Do you have any logs? Any logs from the DeleteAndThenPoll method would help. Or can you point me to the source of that method?

tombuildsstuff commented 4 months ago

@MasterRyd3l checking the HTTP responses, what's happening is we:

  1. Call the DELETE endpoint (i.e. DELETE /subscriptions/XXX/resourceGroups/acctestrg-240516141437412099/providers/Microsoft.Chaos/experiments/acctestcse-mewns?api-version=2023-11-01) which returns an LRO
  2. We then query the LRO (i.e. GET :path: /subscriptions/XXX/providers/Microsoft.Chaos/locations/westeurope/operationStatuses/d8f9d5f8-f842-487e-ab02-d2be26a2e9c7?api-version=2023-11-01&t=XXX&c=XXX) which returns both a Location header (for the original resource URI from 1 - and "status": "succeeded" in the API response)
  3. We then query the original Resource URI (i.e. GET /subscriptions/XXX/resourceGroups/acctestrg-240516141437412099/providers/Microsoft.Chaos/experiments/acctestcse-mewns?api-version=2023-11-01) to ensure it's actually deleted - but the API returns a 200 OK (so we have to assume it's still around) - whereas other APIs return a 404. We do this because some APIs return that a resource is deleted before it's fully deleted, so this ensures we only mark the resource as deleted when it's actually deleted in Azure)

(all of this is done automatically by DeleteThenPoll)

We ultimately time out since the API continues returning a 200 OK for a deleted resource, rather than a 404 (like other APIs do) which we're polling for. As such this looks to be an API bug where the API is returning 200 rather than a 404 when the resource no longer exists? From our side the resource is gone at the highlighted request (the one above deletes the resource), so provided the API returns a 404 this should work as expected:

Screenshot 2024-05-16 at 14 27 34

MasterRyd3l commented 4 months ago

I see, thanks. I get the picture now. We'll need to follow up with ARM to understand the expected behavior. Beware though what other RPs behave is not necessarily always the correct way. ARM update their guidance all the time. These behavior changes frequently.

MasterRyd3l commented 4 months ago

here is the guidance for a regular delete (non-lro)

The resource provider can return 200 (OK) or 204 (NoContent) to indicate that the operation completed successfully. A 200 (OK) should be returned if the object exists and was deleted successfully; and a 204 (NoContent) should be used if the resource does not exist and the request is well formed. 202 (Accepted) can be returned to indicate that the operation will complete asynchronously If the resource group does not exist, 404 (NotFound) will be returned by the proxy layer and will not reach the resource provider. 412 (PreconditionFailed) and other normal REST codes are acceptable as long as they match the REST guidelines.

now here is the guidance for async delete (lro)

The API flow should be to Respond to the initial DELETE request with a 202 Accepted The response headers MUST include a Location header that points to a URL where the ongoing operation can be monitored Optional: The response headers may include an Azure-AsyncOperation header pointing to an Operation resource If a provisioningState property is used for the resource, it MUST transition to a non-terminal state like "Deleting" If the DELETE completes successfully, the URL that was returned in the Location header MUST now return a 200 OK or 204 NoContent to indicate success and the resource MUST disappear.

I am pretty sure we are implementing to the new standard. I think on your side you need to start anticipating both pattern as more and more team will likely migrate to the new pattern in newer version of their API as well

tombuildsstuff commented 4 months ago

@MasterRyd3l thanks for the updated information, AFAIK we do support both of those scenarios, but per the last component there:

the resource MUST disappear

Per my comment above the resource is deleted (as confirmed by polling on the LRO) at the highlighted (red) line, however this is continuing to kick around - as such I don't believe the API is correctly implementing this behaviour?

Based on the guidance in your comment, once a resource has been deleted (that is either via a LRO/polling on the Location header, or as a regular delete) - doing a GET on the resource that's been deleted should return a 404, whereas right now the API is returning a 200 OK (with no body).

MasterRyd3l commented 4 months ago

ah I see what you mean. I misread your previous post. I verified the bug. We'll work on a fix. I'll post back when it's resolved

tombuildsstuff commented 3 months ago

Awesome, thanks @MasterRyd3l :)

MasterRyd3l commented 3 months ago

Hi @tombuildsstuff, the fix is expected to complete on Thursday 6/13

MasterRyd3l commented 2 months ago

Hi @tombuildsstuff, the fix had been rolled out. Can you please confirm if this scenario in question is fixed on your end or not?