hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.76k stars 9.12k forks source link

Destroy aws_ecs_service.service on Fargate gets stuck #3414

Open varas opened 6 years ago

varas commented 6 years ago

Destroy gets stuck on resource aws_ecs_service on Fargate until you manually stop all the tasks.

Terraform Version

Terraform v0.11.3

Affected Resource(s)

Please list the resources as a list, for example:

Terraform Configuration Files

resource "aws_ecs_service" "service" {
  name            = "..."
  cluster         = "${aws_ecs_cluster.cluster.id}"
  task_definition = "${aws_ecs_task_definition.task.arn}"
  desired_count   = 1
  health_check_grace_period_seconds = 1

  load_balancer = {
    target_group_arn = "${aws_alb_target_group.main.arn}"
    container_name   = "..."
    container_port   = 5555
  }

  launch_type = "FARGATE"

  network_configuration {
    security_groups = ["${aws_security_group.awsvpc_sg.id}"]
    subnets         = ["${module.vpc.private_subnets}"]
  }

  depends_on = ["aws_alb_listener.main"]
}

Debug Output

aws_ecs_service.service: Still destroying... (ID: arn:aws:ecs:us-east-1:218277271359:service/blink, 10s elapsed)
aws_ecs_service.service: Still destroying... (ID: arn:aws:ecs:us-east-1:218277271359:service/blink, 20s elapsed)
...

Expected Behavior

In order to destroy the Fargate ECS tasks it should stop all the service tasks.

Actual Behavior

I gets stuck trying to destroy the resource.

Steps to Reproduce

Simple launch a Fargate cluster using launch_type = "FARGATE"

  1. terraform apply
  2. terraform destroy
rnemec-ng commented 5 years ago

Is this going to be looked into? Are there any workarounds? (preferably without manual intervention) Thxnks

marcotesch commented 5 years ago

The ecs_service resource delete operation still does a draining of tasks within a service.

This might not be an open issue anymore? @bflad ?

nwade615 commented 5 years ago

This is happening to me, as well. The CLI gets stuck on aws_ecs_service.api: Still destroying.... In the AWS console, the ECS service appears destroyed, but the running tasks remain. Strangely, it only happens with one of my Fargate services, not all of them. I must manually stop the tasks in the console for the destroy to continue.

Terraform v0.11.13 provider.aws v2.2.0

bavibm commented 5 years ago

Hello all, I'm also getting this issue with Terraform v0.12.6 and AWSProvider v2.23.0

This is my ECS configuration, excluding load balancer and other network-related resources (replacing details with "X"):

# ecs.tf

resource "aws_ecs_cluster" "X" {
  name = var.name_prefix
}

resource "aws_ecs_task_definition" "X" {
  family                   = "${var.name_prefix}-X"
  execution_role_arn       = "arn:aws:iam::X:role/ecsTaskExecutionRole"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]

  cpu                   = 1024
  memory                = 2048
  container_definitions = file("${path.module}/task-definitions/X.json")

}

resource "aws_ecs_service" "X" {
  name            = "${var.name_prefix}-X"
  cluster         = aws_ecs_cluster.X.id
  task_definition = aws_ecs_task_definition.X.arn
  desired_count   = 1
  launch_type     = "FARGATE"

  network_configuration {
    security_groups = [aws_security_group.X.id]
    subnets         = [var.service_subnet_id]
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.X.id
    container_name   = "X"
    container_port   = var.X
  }

  depends_on = [aws_lb_listener.X]
}

Once it starts to destroy my ecs resources, it hangs at aws_ecs_service.X still destroying... I have to manually go into the ECS management console and stop the running tasks in this service, cancel my Terraform destroy, and re-issue the command for it to work.

I am currently looking into using the local-exec provisioner to execute AWS CLI commands on the destroy stage for the service resource in order to automatically stop all tasks running in it as a workaround.

bavibm commented 5 years ago

So I managed to get the aforementioned workaround working for my specific case, I created a shell script that gets executed by Terraform to stop the EC2 task before destroying the service so that it won't get stuck. This requires the AWS CLI to be installed and configured on the same machine.

Here is what it looks like:

#!/usr/bin/env bash

if [ -z ${REGION} ] || [ -z ${CLUSTER} ] || [ -z ${SERVICE} ]; then
  echo "Please specify a region, cluster name, and service name..."
  exit 1
fi

if ! [ -x "$(command -v aws)" ]; then
  echo "The AWS CLI not installed..."
  >&2
  exit 1
else
  echo "AWS CLI found!"
fi

aws ecs list-tasks \
  --region ${REGION} \
  --cluster ${CLUSTER} \
  --service-name ${SERVICE} \
  --output text \
  >$(dirname $0)/.tasks

IFS=$'\n'
arns=($(awk '/TASKARNS/ {print $2}' $(dirname $0)/.tasks))

rm $(dirname $0)/.tasks

Copy and paste the script somewhere in your module or root such as in {module}/scripts/stop-tasks.sh and inside your ecs service resource add the local-exec provisioner so it looks something like this:

resource "aws_ecs_service" "X" {

  ...

  provisioner "local-exec" {
    when = "destroy"
    command = "${path.module}/scripts/stop-tasks.sh > ${path.module}/scripts/stop-tasks.out"
    environment = {
      REGION = var.region,
      CLUSTER = aws_ecs_cluster.X.name,
      SERVICE = aws_ecs_service.X.name
    }
  }
}

I haven't tested it in other situations, but feel free to use and modify at your leisure! I hope this issue gets fixed soon

sethhochberg commented 3 years ago

Following up with another possible workaround, for any who need it. We took inspiration from @bavibm's solution and implemented a destroy provisioner on the cluster resource which stops all tasks, idles the service, and waits for things to reach a state where the cluster itself can be destroyed.

The important part of your script:

SERVICES="$(aws ecs list-services --cluster "${CLUSTER}" | grep "${CLUSTER}" || true | sed -e 's/"//g' -e 's/,//')"
for SERVICE in $SERVICES ; do
  # Idle the service that spawns tasks
  aws ecs update-service --cluster "${CLUSTER}" --service "${SERVICE}" --desired-count 0

  # Stop running tasks
  TASKS="$(aws ecs list-tasks --cluster "${CLUSTER}" --service "${SERVICE}" | grep "${CLUSTER}" || true | sed -e 's/"//g' -e 's/,//')"
  for TASK in $TASKS; do
    aws ecs stop-task --task "$TASK"
  done

  # Delete the service after it becomes inactive
  aws ecs wait services-inactive --cluster "${CLUSTER}" --service "${SERVICE}"
  aws ecs delete-service --cluster "${CLUSTER}" --service "${SERVICE}"
done

Your cluster definition:

resource "aws_ecs_cluster" "whatevername" {
  name = "whatever_cluster_name"

  provisioner "local-exec" {
    when = destroy
    command = "${path.module}/scripts/stop-tasks.sh"
    environment = {
      CLUSTER = self.name
    }
  }
}

Because of https://github.com/hashicorp/terraform/issues/23679, we are only relying on self references to pass into the cleanup task, and discover the rest based on the cluster data available via the AWS CLI. Our AWS profile and region are set via other configuration on the host that executes the script.

bclabs-kylian commented 2 years ago

this happened to me as well. i had to delete ECS security group from the RDS security group manually.

moazzamk commented 1 year ago

This is happening to me. If I try to delete the security group manually (through Amazon console), it says it is being used by a network interface. If I try to delete the network interface, it says it is being used by the security group.

davidbudnick commented 7 months ago

Still happening any solution?

module.ecs.aws_ecs_service.keep_ui_service_staging: Still destroying... [id=arn:aws:ecs:us-east-1:905418292571:serv...luster-staging/keep-ui-service-staging, 1m0s elapsed]
module.ecs.aws_ecs_service.keep_ui_service_staging: Still destroying... [id=arn:aws:ecs:us-east-1:905418292571:serv...luster-staging/keep-ui-service-staging, 1m10s elapsed]

(It hit almost 6 mins before I manually killed the job)

Manually required to run: terraform state rm module.ecs.aws_ecs_service.keep_ui_service_staging

davidbudnick commented 7 months ago

Update:

Seems as the team is aware of the issue and have suggested adding a depends_on for the policy: REF: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/ecs_service

I was able to get it working here without having to add any extra scripts. Finished in around 2m40s. (Could be related to the timeout of the container while draining)

Edit: Looks as per the docs:

The following target group attributes are supported. You can modify these attributes only if the target group type is instance or ip. If the target group type is alb, these attributes always use their default values. RE: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html

Therefore it will be 300 seconds by default but the bonus is the resource deletes without having to manually stop the job 🥳

Overall if you don't add the depends_on for the policy it will never finish.

Most likely the issue can be closed 📕

javierguzman commented 5 months ago

I "fixed" this by setting the desired count to zero, similar to what others as previously done:

provisioner "local-exec" {
    when = destroy
    command = <<EOF
    echo "Update service desired count to 0 before destroy."
    REGION=${split(":", self.cluster)[3]}
    aws ecs update-service --region $REGION --cluster ${self.cluster} --service ${self.name} --desired-count 0 --force-new-deployment
    echo "Update service command executed successfully."
    EOF
  }

  timeouts {
    delete = "5m"
  }

I guess this should be done automatically by the provider.

julianevanneeleman commented 2 months ago

If your service has a static desired_count, an alternative work-around could be to use an aws_appautoscaling_target:

resource "aws_appautoscaling_target" "static_capacity" {
  service_namespace  = "ecs"
  resource_id        = "service/${aws_ecs_cluster.my_cluster.id}/${aws_ecs_service.my_service.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  min_capacity       = 1
  max_capacity       = 1
}

Set min_capacity and max_capacity to whatever value you want desired_count to be. Make sure you remove the desired_count value from your aws_ecs_service (or set it to 0), and add the following to prevent configuration drift:

lifecycle {
  ignore_changes = [desired_count]
}

This should make terraform destroy succeed in a single pass.