TL;DR: Resource should delete properly without a timeout.
The root of the problem seems to exist from the dependency between the aws_ecs_cluster_capacity_providers and the aws_ecs_capacity_providers. The following pseudo-code is the most intuitive configuration:
This configuration works 100% of the time on creation and deletion. However, it breaks if you ever try to only delete the aws_ecs_capacity_provider. This is particularly relevant if you have a for_each on the capacity provider - something like:
resource "aws_ecs_cluster_capacity_providers" "test" {
capacity_providers = [ for key, value in aws_ecs_capacity_provider.test : value.name ]
}
resource "aws_ecs_capacity_provider" "test" {
for_each = var.set_of_asgs
name = each.key
...
}
In this ^ scenario, if the variables change and an aws_ecs_capacity_provider needs to be deleted during an apply, the run will not complete. It appears that an aws_ecs_capacity_providercannot be deleted if it is referenced by an existing aws_ecs_cluster_capacity_providers. However the dependency graph requires the aws_ecs_capacity_provider deletion to occur first. This leads to deadlock, and an eventual timeout.
One workaround that fixes the update issue is to remove the attribute reference in the capacity_providers block of aws_ecs_cluster_capacity_providers and instead just "hardcode" the names. Something like:
resource "aws_ecs_cluster_capacity_providers" "test" {
capacity_providers = [ for key, value in var.set_of_asgs : value.name ]
}
resource "aws_ecs_capacity_provider" "test" {
for_each = var.set_of_asgs
name = each.key
...
}
This removes the dependency, which means the update to aws_ecs_cluster_capacity_providers and deletion of aws_ecs_capacity_provider occur "simultaneously". The aws_ecs_cluster_capacity_providers update will complete first, and once the the aws_ecs_capacity_provider is no longer referenced, its deletion will complete as well. However, this configuration breaks during initial creation since aws_ecs_cluster_capacity_providers can't reference an aws_ecs_capacity_provider that doesn't exist yet, and without the dependency from name reference, that can't be guaranteed.
Actual Behavior
Resource deletion times out after 20 minutes, and the resource still exists in the AWS console.
Relevant Error/Panic Output Snippet
{
"@level": "error",
"@message": "Error: error waiting for ECS Capacity Provider (arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo) to delete: timeout while waiting for resource to be gone (last state: 'ACTIVE', timeout: 20m0s)",
"@module": "terraform.ui",
"@timestamp": "2023-01-29T00:12:42.663107Z",
"diagnostic": {
"severity": "error",
"summary": "error waiting for ECS Capacity Provider (arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo) to delete: timeout while waiting for resource to be gone (last state: 'ACTIVE', timeout: 20m0s)",
"detail": ""
},
"type": "diagnostic"
}
Terraform Configuration Files
terraform {}
provider "aws" {
region = "us-east-1"
}
resource "aws_ecs_cluster" "test" {
name = "test"
}
resource "aws_ecs_cluster_capacity_providers" "test" {
cluster_name = aws_ecs_cluster.test.name
capacity_providers = [
"FARGATE",
// Note the comment below. The reference (commented out) works on creation, the hardcoded name ("foo") works on update. Neither works consistently in both scenarios.
"foo"//aws_ecs_capacity_provider.this.name
]
}
resource "aws_ecs_capacity_provider" "this" {
name = "foo"
auto_scaling_group_provider {
auto_scaling_group_arn = aws_autoscaling_group.foo.arn
}
}
data "aws_ssm_parameter" "ami" {
name = "/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id"
}
resource "aws_launch_template" "foo" {
name_prefix = "foo"
image_id = data.aws_ssm_parameter.ami.value
instance_type = "t2.micro"
}
resource "aws_autoscaling_group" "foo" {
availability_zones = ["us-east-1a"]
desired_capacity = 1
max_size = 1
min_size = 1
launch_template {
id = aws_launch_template.foo.id
version = "$Latest"
}
}
Steps to Reproduce
Provision an aws_ecs_cluster resource
Provision an aws_ecs_capacity_provider (with necessary aws_autoscaling_group)
Provision an aws_ecs_cluster_capacity_provider linking the aws_ecs_capacity_provider in step 2 to the aws_ecs_cluster in step 1 using an attribute reference (capacity_providers = [aws_ecs_capacity_provider.step2.name])
Attempt to delete the aws_ecs_capacity_provider (and subsequently update the aws_ecs_cluster_capacity_provider)
]
Debug Output
On an update when trying to delete an `aws_ecs_capacity_provider:
When using the attribute reference:
Plan: 0 to add, 1 to change, 3 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
aws_ecs_capacity_provider.this: Destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo]
aws_ecs_capacity_provider.this: Still destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo, 10s elapsed]
aws_ecs_capacity_provider.this: Still destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo, 20s elapsed]
...
aws_ecs_capacity_provider.this: Still destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo, 19m50s elapsed]
aws_ecs_capacity_provider.this: Still destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo, 20m0s elapsed]
╷
│ Error: waiting for ECS Capacity Provider (arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo) to delete: timeout while waiting for resource to be gone (last state: 'ACTIVE', timeout: 20m0s)
When hardcoding:
Plan: 0 to add, 1 to change, 3 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
aws_ecs_capacity_provider.this: Destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo]
aws_ecs_cluster_capacity_providers.test: Modifying... [id=test]
aws_ecs_capacity_provider.this: Still destroying... [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo, 10s elapsed]
aws_ecs_cluster_capacity_providers.test: Still modifying... [id=test, 10s elapsed]
aws_ecs_cluster_capacity_providers.test: Modifications complete after 12s [id=test]
aws_ecs_capacity_provider.this: Destruction complete after 15s
...
aws_launch_template.foo: Destruction complete after 0s
Apply complete! Resources: 0 added, 1 changed, 3 destroyed.
Notice how the aws_ecs_capacity_provider deletion doesn't complete until after the aws_ecs_cluster_capacity_providers modification completes.
The problem w/ this 2nd config is on creation there is a race condition between aws_ecs_cluster_capacity_providers and aws_ecs_capacity_provider, and the aws_ecs_capacity_providermust win, or it errors:
Plan: 3 to add, 1 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
aws_ecs_cluster_capacity_providers.test: Modifying... [id=test]
aws_launch_template.foo: Creating...
aws_launch_template.foo: Creation complete after 2s [id=lt-059633ca56306dd70]
aws_autoscaling_group.foo: Creating...
aws_autoscaling_group.foo: Still creating... [10s elapsed]
aws_autoscaling_group.foo: Still creating... [20s elapsed]
aws_autoscaling_group.foo: Still creating... [30s elapsed]
aws_autoscaling_group.foo: Creation complete after 37s [id=terraform-20230129222618772800000003]
aws_ecs_capacity_provider.this: Creating...
aws_ecs_capacity_provider.this: Creation complete after 2s [id=arn:aws:ecs:us-east-1:000000000000:capacity-provider/foo]
╷
│ Error: error updating ECS Cluster (test) Capacity Providers: InvalidParameterException: The specified capacity provider 'foo' is not in an ACTIVE state. Specify a valid capacity provider and try again.
│
│ with aws_ecs_cluster_capacity_providers.test,
│ on main.tf line 10, in resource "aws_ecs_cluster_capacity_providers" "test":
│ 10: resource "aws_ecs_cluster_capacity_providers" "test" {
^ Notice that its effectively guaranteed this race condition will fail if the aws_ecs_capacity_provider is depending on the ASG and the aws_ecs_cluster_capacity_providers has no dependencies.
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
Volunteering to Work on This Issue
If you are interested in working on this issue, please leave a comment.
If this would be your first contribution, please review the contribution guide.
Terraform Core Version
1.3.6
AWS Provider Version
4.40.0
Affected Resource(s)
Expected Behavior
TL;DR: Resource should delete properly without a timeout.
The root of the problem seems to exist from the dependency between the
aws_ecs_cluster_capacity_providers
and theaws_ecs_capacity_providers
. The following pseudo-code is the most intuitive configuration:This configuration works 100% of the time on creation and deletion. However, it breaks if you ever try to only delete the
aws_ecs_capacity_provider
. This is particularly relevant if you have afor_each
on the capacity provider - something like:In this ^ scenario, if the variables change and an
aws_ecs_capacity_provider
needs to be deleted during an apply, the run will not complete. It appears that anaws_ecs_capacity_provider
cannot be deleted if it is referenced by an existingaws_ecs_cluster_capacity_providers
. However the dependency graph requires theaws_ecs_capacity_provider
deletion to occur first. This leads to deadlock, and an eventual timeout.One workaround that fixes the update issue is to remove the attribute reference in the
capacity_providers
block ofaws_ecs_cluster_capacity_providers
and instead just "hardcode" the names. Something like:This removes the dependency, which means the update to
aws_ecs_cluster_capacity_providers
and deletion ofaws_ecs_capacity_provider
occur "simultaneously". Theaws_ecs_cluster_capacity_providers
update will complete first, and once the theaws_ecs_capacity_provider
is no longer referenced, its deletion will complete as well. However, this configuration breaks during initial creation sinceaws_ecs_cluster_capacity_providers
can't reference anaws_ecs_capacity_provider
that doesn't exist yet, and without the dependency fromname
reference, that can't be guaranteed.Actual Behavior
Resource deletion times out after 20 minutes, and the resource still exists in the AWS console.
Relevant Error/Panic Output Snippet
Terraform Configuration Files
Steps to Reproduce
aws_ecs_cluster
resourceaws_ecs_capacity_provider
(with necessaryaws_autoscaling_group
)aws_ecs_cluster_capacity_provider
linking theaws_ecs_capacity_provider
in step 2 to theaws_ecs_cluster
in step 1 using an attribute reference (capacity_providers = [aws_ecs_capacity_provider.step2.name]
)aws_ecs_capacity_provider
(and subsequently update theaws_ecs_cluster_capacity_provider
) ]Debug Output
On an update when trying to delete an `aws_ecs_capacity_provider: When using the attribute reference:
When hardcoding:
Notice how the
aws_ecs_capacity_provider
deletion doesn't complete until after theaws_ecs_cluster_capacity_providers
modification completes.The problem w/ this 2nd config is on creation there is a race condition between
aws_ecs_cluster_capacity_providers
andaws_ecs_capacity_provider
, and theaws_ecs_capacity_provider
must win, or it errors:^ Notice that its effectively guaranteed this race condition will fail if the
aws_ecs_capacity_provider
is depending on the ASG and theaws_ecs_cluster_capacity_providers
has no dependencies.Panic Output
No response
Important Factoids
No response
References
No response
Would you like to implement a fix?
None