Open jameswilsongrant opened 3 years ago
@jameswilsongrant Sorry this isn't working right. To be clear, when you say in step 1, to create a cluster in Terraform, step 2 will be performed in the same state as step 1? If they are in the same state, Terraform's dependency graph should make sure that delete for the asg, capacity provider, and template are called first.
There is an acceptance test that mirrors your scenario called TestAccAWSEcsCluster_SingleCapacityProvider
. I've pieced together the configuration for the test. This test passes okay meaning that Terraform was able to create and destroy everything without timing out. If you see differences in the test configuration and yours that change how this behaves, let us know.
data "aws_ami" "test" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["amzn-ami-hvm-*-x86_64-gp2"]
}
}
data "aws_availability_zones" "available" {
state = "available"
filter {
name = "opt-in-status"
values = ["opt-in-not-required"]
}
}
resource "aws_launch_template" "test" {
image_id = data.aws_ami.test.id
instance_type = "t3.micro"
name = "yakluster"
}
resource "aws_autoscaling_group" "test" {
availability_zones = data.aws_availability_zones.available.names
desired_capacity = 0
max_size = 0
min_size = 0
name = "yakluster"
launch_template {
id = aws_launch_template.test.id
}
tags = [
{
key = "foo"
value = "bar"
propagate_at_launch = true
},
]
}
resource "aws_ecs_capacity_provider" "test" {
name = "yakluster"
auto_scaling_group_provider {
auto_scaling_group_arn = aws_autoscaling_group.test.arn
}
}
resource "aws_ecs_cluster" "test" {
name = "yakluster"
capacity_providers = [aws_ecs_capacity_provider.test.name]
default_capacity_provider_strategy {
base = 1
capacity_provider = aws_ecs_capacity_provider.test.name
weight = 1
}
}
Hello!
I'm facing the same issue and cannot complete a terraform destroy
without manually terminating the ec2 instances that are still present in the ASG. In my setup, running terraform destroy
3 times in a row leads to a state in which the ASG tries to reach a desired count of 0 without ever terminating the instances present. In the first and second attempt it does not even try that. At that point, terminating the ec2 instance will make it succeed (more or less as it still left behind the capacity provider which could be destroyed in a 4th attempt).
In my case, I also have managed_termination_protection
enabled in the aws_ecs_capacity_provider
which requires to turn on protect_from_scale_in
in the aws_autoscaling_group
. I guess this is what prevent the ASG from reaching the desired count of 0 when trying to destroy. I checked and during the third destruction attempt, at a point where the cluster is empty (no more task or service), the ec2 instance still has ProtectedFromScaleIn
set to true. So the ASG behaviour makes sense.
I can provide trace logs of four consecutive attempts to destroy my setup if needed. For the time being, I would just show the following extracts.
From the first terraform destroy
, after the last ecs_service is destroyed:
2021/04/21 14:05:57 [DEBUG] aws_ecs_cluster.ecs_cluster: applying the planned Delete change
aws_ecs_cluster.ecs_cluster: Destroying... [id=arn:aws:ecs:eu-west-1:094585523650:cluster/ecs_perftracker_dev-odormond]
2021/04/21 14:05:57 [TRACE] dag/walk: vertex "aws_ecs_capacity_provider.perftracker (destroy)" is waiting for "aws_ecs_cluster.ecs_cluster (destroy)"
2021/04/21 14:05:57 [TRACE] dag/walk: vertex "aws_launch_configuration.perftracker_launch_configuration (destroy)" is waiting for "aws_autoscaling_group.asg (destroy)"
2021/04/21 14:05:57 [TRACE] dag/walk: vertex "aws_autoscaling_group.asg (destroy)" is waiting for "aws_ecs_capacity_provider.perftracker (destroy)"
There is no "applying the planned Delete change" for the capacity provider or the ASG.
Second attempt, it applied the planned Delete change to the ecs_cluster and the ecs_capacity_provider:
2021/04/21 14:17:25 [DEBUG] aws_ecs_capacity_provider.perftracker: applying the planned Delete change
2021/04/21 14:17:25 [DEBUG] aws_ecs_cluster.ecs_cluster: applying the planned Delete change
aws_ecs_capacity_provider.perftracker: Destroying... [id=arn:aws:ecs:eu-west-1:094585523650:capacity-provider/capacity_provider_dev-odormond]
aws_ecs_cluster.ecs_cluster: Destroying... [id=arn:aws:ecs:eu-west-1:094585523650:cluster/ecs_perftracker_dev-odormond]
2021/04/21 14:17:27 [TRACE] dag/walk: vertex "module.vpc.aws_subnet.private[1] (destroy)" is waiting for "aws_autoscaling_group.asg (destroy)"
2021/04/21 14:17:27 [TRACE] dag/walk: vertex "aws_autoscaling_group.asg (destroy)" is waiting for "aws_ecs_capacity_provider.perftracker (destroy)"
Third attempt, it applied the planned Delete change to the cluster, capacity provider and the ASG:
2021/04/21 16:06:38 [DEBUG] aws_ecs_capacity_provider.perftracker: applying the planned Delete change
2021/04/21 16:06:38 [DEBUG] aws_ecs_cluster.ecs_cluster: applying the planned Delete change
aws_ecs_capacity_provider.perftracker: Destroying... [id=arn:aws:ecs:eu-west-1:094585523650:capacity-provider/capacity_provider_dev-odormond]
aws_ecs_cluster.ecs_cluster: Destroying... [id=arn:aws:ecs:eu-west-1:094585523650:cluster/ecs_perftracker_dev-odormond]
2021/04/21 16:06:38 [DEBUG] aws_autoscaling_group.asg: applying the planned Delete change
aws_autoscaling_group.asg: Destroying... [id=tf-asg-20210421114236113800000011]
Again, it only succeed because I manually destroyed the ec2 instances left.
Regarding the acceptance test shown, IMHO, it's too naive to expose the problem:
What about the possibility of a "force_destroy" flag that will set the desired capacity to 0 on any capacity provider ASGs? This is similar to how aws_s3_bucket has a corresponding flag in which the TF provider deletes any objects in the bucket before issuing the delete bucket API call.
we've had to find an on_delete provider (stack overflow is your friend) made of shell code and AWS CLI to find our way from the cluster, to the capacity_provider, to the instances. we set the capacity of the cluster to 0/0/0, and we terminate the instances (among other things)
+1
This issue is still present from 2019: https://github.com/hashicorp/terraform-provider-aws/issues/4852
And it has been moved here https://github.com/hashicorp/terraform-provider-aws/issues/11409 which is another issue on its own that hasn't solved the problem.
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed. Maintainers can also remove the stale label.
If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!
There's an ordering issue with resource destruction when ecs has ec2 capacity:
You'll end up like below. All the tasks/services are deleted but the capacity provider and ASG will still be there. There will still be whatever ec2s the ASG is creating attached to the cluster. This will then sit until timeout. Deleting the ASG at any point will move it along.
I'm guessing there has to be inspection from ecs -> capacity provider -> asg to destroy this in the right order (destroying the asg will terminate the instances as well). Seems like a bit of an edge case though.