aws_ecs_cluster deletion will not complete if there is registered ec2 capacity

jameswilsongrant commented 3 years ago

There's an ordering issue with resource destruction when ecs has ec2 capacity:

Create an aws_ecs_cluster in terraform
Create the launch template, asg, and capacity provider to connect them to the existing ecs cluster
terraform apply
terraform destroy

You'll end up like below. All the tasks/services are deleted but the capacity provider and ASG will still be there. There will still be whatever ec2s the ASG is creating attached to the cluster. This will then sit until timeout. Deleting the ASG at any point will move it along.

aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 6m40s elapsed]
aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 6m50s elapsed]
aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 7m0s elapsed]
aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 7m10s elapsed]
aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 7m20s elapsed]
aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 7m30s elapsed]
aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 7m40s elapsed]
aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 7m50s elapsed]
aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 8m0s elapsed]
aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 8m10s elapsed]
aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 8m20s elapsed]
aws_ecs_cluster.ecs: Still destroying... [id=arn:aws:ecs:us-west-2:snipped:cluster/demo, 8m30s elapsed]
* I deleted the capacity provider and ASG providing ec2 capacity here manually *
aws_ecs_cluster.ecs: Destruction complete after 8m30s

I'm guessing there has to be inspection from ecs -> capacity provider -> asg to destroy this in the right order (destroying the asg will terminate the instances as well). Seems like a bit of an edge case though.

YakDriver commented 3 years ago

@jameswilsongrant Sorry this isn't working right. To be clear, when you say in step 1, to create a cluster in Terraform, step 2 will be performed in the same state as step 1? If they are in the same state, Terraform's dependency graph should make sure that delete for the asg, capacity provider, and template are called first.

There is an acceptance test that mirrors your scenario called TestAccAWSEcsCluster_SingleCapacityProvider. I've pieced together the configuration for the test. This test passes okay meaning that Terraform was able to create and destroy everything without timing out. If you see differences in the test configuration and yours that change how this behaves, let us know.

data "aws_ami" "test" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn-ami-hvm-*-x86_64-gp2"]
  }
}

data "aws_availability_zones" "available" {
  state = "available"

  filter {
    name   = "opt-in-status"
    values = ["opt-in-not-required"]
  }
}

resource "aws_launch_template" "test" {
  image_id      = data.aws_ami.test.id
  instance_type = "t3.micro"
  name          = "yakluster"
}

resource "aws_autoscaling_group" "test" {
  availability_zones = data.aws_availability_zones.available.names
  desired_capacity   = 0
  max_size           = 0
  min_size           = 0
  name               = "yakluster"

  launch_template {
    id = aws_launch_template.test.id
  }

  tags = [
    {
      key                 = "foo"
      value               = "bar"
      propagate_at_launch = true
    },
  ]
}

resource "aws_ecs_capacity_provider" "test" {
  name = "yakluster"

  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.test.arn
  }
}

resource "aws_ecs_cluster" "test" {
  name = "yakluster"

  capacity_providers = [aws_ecs_capacity_provider.test.name]

  default_capacity_provider_strategy {
    base              = 1
    capacity_provider = aws_ecs_capacity_provider.test.name
    weight            = 1
  }
}

odormond commented 3 years ago

Hello!

I'm facing the same issue and cannot complete a terraform destroy without manually terminating the ec2 instances that are still present in the ASG. In my setup, running terraform destroy 3 times in a row leads to a state in which the ASG tries to reach a desired count of 0 without ever terminating the instances present. In the first and second attempt it does not even try that. At that point, terminating the ec2 instance will make it succeed (more or less as it still left behind the capacity provider which could be destroyed in a 4th attempt).

In my case, I also have managed_termination_protection enabled in the aws_ecs_capacity_provider which requires to turn on protect_from_scale_in in the aws_autoscaling_group. I guess this is what prevent the ASG from reaching the desired count of 0 when trying to destroy. I checked and during the third destruction attempt, at a point where the cluster is empty (no more task or service), the ec2 instance still has ProtectedFromScaleIn set to true. So the ASG behaviour makes sense.

I can provide trace logs of four consecutive attempts to destroy my setup if needed. For the time being, I would just show the following extracts.

From the first terraform destroy, after the last ecs_service is destroyed:

2021/04/21 14:05:57 [DEBUG] aws_ecs_cluster.ecs_cluster: applying the planned Delete change
aws_ecs_cluster.ecs_cluster: Destroying... [id=arn:aws:ecs:eu-west-1:094585523650:cluster/ecs_perftracker_dev-odormond]
2021/04/21 14:05:57 [TRACE] dag/walk: vertex "aws_ecs_capacity_provider.perftracker (destroy)" is waiting for "aws_ecs_cluster.ecs_cluster (destroy)"
2021/04/21 14:05:57 [TRACE] dag/walk: vertex "aws_launch_configuration.perftracker_launch_configuration (destroy)" is waiting for "aws_autoscaling_group.asg (destroy)"
2021/04/21 14:05:57 [TRACE] dag/walk: vertex "aws_autoscaling_group.asg (destroy)" is waiting for "aws_ecs_capacity_provider.perftracker (destroy)"

There is no "applying the planned Delete change" for the capacity provider or the ASG.

Second attempt, it applied the planned Delete change to the ecs_cluster and the ecs_capacity_provider:

2021/04/21 14:17:25 [DEBUG] aws_ecs_capacity_provider.perftracker: applying the planned Delete change
2021/04/21 14:17:25 [DEBUG] aws_ecs_cluster.ecs_cluster: applying the planned Delete change
aws_ecs_capacity_provider.perftracker: Destroying... [id=arn:aws:ecs:eu-west-1:094585523650:capacity-provider/capacity_provider_dev-odormond]
aws_ecs_cluster.ecs_cluster: Destroying... [id=arn:aws:ecs:eu-west-1:094585523650:cluster/ecs_perftracker_dev-odormond]
2021/04/21 14:17:27 [TRACE] dag/walk: vertex "module.vpc.aws_subnet.private[1] (destroy)" is waiting for "aws_autoscaling_group.asg (destroy)"
2021/04/21 14:17:27 [TRACE] dag/walk: vertex "aws_autoscaling_group.asg (destroy)" is waiting for "aws_ecs_capacity_provider.perftracker (destroy)"

Third attempt, it applied the planned Delete change to the cluster, capacity provider and the ASG:

2021/04/21 16:06:38 [DEBUG] aws_ecs_capacity_provider.perftracker: applying the planned Delete change
2021/04/21 16:06:38 [DEBUG] aws_ecs_cluster.ecs_cluster: applying the planned Delete change
aws_ecs_capacity_provider.perftracker: Destroying... [id=arn:aws:ecs:eu-west-1:094585523650:capacity-provider/capacity_provider_dev-odormond]
aws_ecs_cluster.ecs_cluster: Destroying... [id=arn:aws:ecs:eu-west-1:094585523650:cluster/ecs_perftracker_dev-odormond]
2021/04/21 16:06:38 [DEBUG] aws_autoscaling_group.asg: applying the planned Delete change
aws_autoscaling_group.asg: Destroying... [id=tf-asg-20210421114236113800000011]

Again, it only succeed because I manually destroyed the ec2 instances left.

Regarding the acceptance test shown, IMHO, it's too naive to expose the problem:

Its ASG does not allow for ec2 instances to be started.
It does not start any task in the cluster so there is little chance an ec2 instance would be requested from the capacity provider.

richardgavel commented 3 years ago

What about the possibility of a "force_destroy" flag that will set the desired capacity to 0 on any capacity provider ASGs? This is similar to how aws_s3_bucket has a corresponding flag in which the TF provider deletes any objects in the bucket before issuing the delete bucket API call.

jaffel-lc commented 2 years ago

we've had to find an on_delete provider (stack overflow is your friend) made of shell code and AWS CLI to find our way from the cluster, to the capacity_provider, to the instances. we set the capacity of the cluster to 0/0/0, and we terminate the instances (among other things)

mavin-audacia commented 2 years ago

+1

This issue is still present from 2019: https://github.com/hashicorp/terraform-provider-aws/issues/4852

And it has been moved here https://github.com/hashicorp/terraform-provider-aws/issues/11409 which is another issue on its own that hasn't solved the problem.

github-actions[bot] commented 1 week ago

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed. Maintainers can also remove the stale label.

If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!

hashicorp / terraform-provider-aws

aws_ecs_cluster deletion will not complete if there is registered ec2 capacity #18849