[Bug]: Timeout when destroying `aws_ecs_capacity_provider` from dependency on `aws_ecs_cluster_capacity_providers`

Terraform Core Version

1.3.6

AWS Provider Version

4.40.0

Affected Resource(s)

aws_ecs_capacity_provider
aws_ecs_cluster_capacity_providers

Expected Behavior

TL;DR: Resource should delete properly without a timeout.

The root of the problem seems to exist from the dependency between the aws_ecs_cluster_capacity_providers and the aws_ecs_capacity_providers. The following pseudo-code is the most intuitive configuration:

resource "aws_ecs_cluster" "test" {
  name = "test"
}

resource "aws_ecs_cluster_capacity_providers" "test" {
  cluster_name       = aws_ecs_cluster.test.name
  capacity_providers = [
    aws_ecs_capacity_provider.test.name
  ]
}

resource "aws_ecs_capacity_provider" "test" {
  name = "foo"
  ...
}

This configuration works 100% of the time on creation and deletion. However, it breaks if you ever try to only delete the aws_ecs_capacity_provider. This is particularly relevant if you have a for_each on the capacity provider - something like:

resource "aws_ecs_cluster_capacity_providers" "test" {
  capacity_providers = [ for key, value in aws_ecs_capacity_provider.test : value.name  ]
}

resource "aws_ecs_capacity_provider" "test" {
  for_each = var.set_of_asgs
  name = each.key
  ...
}

In this ^ scenario, if the variables change and an aws_ecs_capacity_provider needs to be deleted during an apply, the run will not complete. It appears that an aws_ecs_capacity_provider cannot be deleted if it is referenced by an existing aws_ecs_cluster_capacity_providers. However the dependency graph requires the aws_ecs_capacity_provider deletion to occur first. This leads to deadlock, and an eventual timeout.

One workaround that fixes the update issue is to remove the attribute reference in the capacity_providers block of aws_ecs_cluster_capacity_providers and instead just "hardcode" the names. Something like:

resource "aws_ecs_cluster_capacity_providers" "test" {
  capacity_providers = [ for key, value in var.set_of_asgs : value.name  ]
}

resource "aws_ecs_capacity_provider" "test" {
  for_each = var.set_of_asgs
  name = each.key
  ...
}

This removes the dependency, which means the update to aws_ecs_cluster_capacity_providers and deletion of aws_ecs_capacity_provider occur "simultaneously". The aws_ecs_cluster_capacity_providers update will complete first, and once the the aws_ecs_capacity_provider is no longer referenced, its deletion will complete as well. However, this configuration breaks during initial creation since aws_ecs_cluster_capacity_providers can't reference an aws_ecs_capacity_provider that doesn't exist yet, and without the dependency from name reference, that can't be guaranteed.

Actual Behavior

Resource deletion times out after 20 minutes, and the resource still exists in the AWS console.

Relevant Error/Panic Output Snippet

{
  "@level": "error",
  "@message": "Error: error waiting for ECS Capacity Provider (arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo) to delete: timeout while waiting for resource to be gone (last state: 'ACTIVE', timeout: 20m0s)",
  "@module": "terraform.ui",
  "@timestamp": "2023-01-29T00:12:42.663107Z",
  "diagnostic": {
    "severity": "error",
    "summary": "error waiting for ECS Capacity Provider (arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo) to delete: timeout while waiting for resource to be gone (last state: 'ACTIVE', timeout: 20m0s)",
    "detail": ""
  },
  "type": "diagnostic"
}

Terraform Configuration Files

terraform {}
provider "aws" {
  region = "us-east-1"
}

resource "aws_ecs_cluster" "test" {
  name = "test"
}

resource "aws_ecs_cluster_capacity_providers" "test" {
  cluster_name       = aws_ecs_cluster.test.name
  capacity_providers = [
    "FARGATE",
    // Note the comment below. The reference (commented out) works on creation, the hardcoded name ("foo") works on update. Neither works consistently in both scenarios.
    "foo"//aws_ecs_capacity_provider.this.name
  ]
}

resource "aws_ecs_capacity_provider" "this" {
  name = "foo"

  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.foo.arn
  }
}

data "aws_ssm_parameter" "ami" {
  name = "/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id"
}

resource "aws_launch_template" "foo" {
  name_prefix   = "foo"
  image_id      = data.aws_ssm_parameter.ami.value
  instance_type = "t2.micro"
}

resource "aws_autoscaling_group" "foo" {
  availability_zones = ["us-east-1a"]
  desired_capacity   = 1
  max_size           = 1
  min_size           = 1

  launch_template {
    id      = aws_launch_template.foo.id
    version = "$Latest"
  }
}

Steps to Reproduce

Provision an aws_ecs_cluster resource
Provision an aws_ecs_capacity_provider (with necessary aws_autoscaling_group)
Provision an aws_ecs_cluster_capacity_provider linking the aws_ecs_capacity_provider in step 2 to the aws_ecs_cluster in step 1 using an attribute reference (capacity_providers = [aws_ecs_capacity_provider.step2.name])
Attempt to delete the aws_ecs_capacity_provider (and subsequently update the aws_ecs_cluster_capacity_provider) ]

Debug Output

On an update when trying to delete an `aws_ecs_capacity_provider: When using the attribute reference:

Plan: 0 to add, 1 to change, 3 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

aws_ecs_capacity_provider.this: Destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo]
aws_ecs_capacity_provider.this: Still destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo, 10s elapsed]
aws_ecs_capacity_provider.this: Still destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo, 20s elapsed]
...
aws_ecs_capacity_provider.this: Still destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo, 19m50s elapsed]
aws_ecs_capacity_provider.this: Still destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo, 20m0s elapsed]
╷
│ Error: waiting for ECS Capacity Provider (arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo) to delete: timeout while waiting for resource to be gone (last state: 'ACTIVE', timeout: 20m0s)

When hardcoding:

Plan: 0 to add, 1 to change, 3 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

aws_ecs_capacity_provider.this: Destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo]
aws_ecs_cluster_capacity_providers.test: Modifying... [id=test]
aws_ecs_capacity_provider.this: Still destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo, 10s elapsed]
aws_ecs_cluster_capacity_providers.test: Still modifying... [id=test, 10s elapsed]
aws_ecs_cluster_capacity_providers.test: Modifications complete after 12s [id=test]
aws_ecs_capacity_provider.this: Destruction complete after 15s
...
aws_launch_template.foo: Destruction complete after 0s

Apply complete! Resources: 0 added, 1 changed, 3 destroyed.

Notice how the aws_ecs_capacity_provider deletion doesn't complete until after the aws_ecs_cluster_capacity_providers modification completes.

The problem w/ this 2nd config is on creation there is a race condition between aws_ecs_cluster_capacity_providers and aws_ecs_capacity_provider, and the aws_ecs_capacity_provider must win, or it errors:

Plan: 3 to add, 1 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

aws_ecs_cluster_capacity_providers.test: Modifying... [id=test]
aws_launch_template.foo: Creating...
aws_launch_template.foo: Creation complete after 2s [id=lt-059633ca56306dd70]
aws_autoscaling_group.foo: Creating...
aws_autoscaling_group.foo: Still creating... [10s elapsed]
aws_autoscaling_group.foo: Still creating... [20s elapsed]
aws_autoscaling_group.foo: Still creating... [30s elapsed]
aws_autoscaling_group.foo: Creation complete after 37s [id=terraform-20230129222618772800000003]
aws_ecs_capacity_provider.this: Creating...
aws_ecs_capacity_provider.this: Creation complete after 2s [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo]
╷
│ Error: error updating ECS Cluster (test) Capacity Providers: InvalidParameterException: The specified capacity provider 'foo' is not in an ACTIVE state. Specify a valid capacity provider and try again.
│
│   with aws_ecs_cluster_capacity_providers.test,
│   on main.tf line 10, in resource "aws_ecs_cluster_capacity_providers" "test":
│   10: resource "aws_ecs_cluster_capacity_providers" "test" {

^ Notice that its effectively guaranteed this race condition will fail if the aws_ecs_capacity_provider is depending on the ASG and the aws_ecs_cluster_capacity_providers has no dependencies.

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None

hashicorp / terraform-provider-aws