aws_ecs_service no longer retries when target group is not attached to load balancer

tomelliff commented 6 years ago

Terraform Version

1.9.0 AWS provider, all TF versions

Affected Resource(s)

aws_ecs_service

Terraform Configuration Files

Taken from testAccAWSEcsService_healthCheckGracePeriodSeconds but removing the depends_on the lb_listener resource:

data "aws_availability_zones" "available" {}

resource "aws_vpc" "main" {
  cidr_block = "10.10.0.0/16"
  tags {
    Name = "foo"
  }
}

resource "aws_subnet" "main" {
  count = 2
  cidr_block = "${cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)}"
  availability_zone = "${data.aws_availability_zones.available.names[count.index]}"
  vpc_id = "${aws_vpc.main.id}"
}

resource "aws_ecs_cluster" "main" {
  name = "foo"
}

resource "aws_ecs_task_definition" "with_lb_changes" {
  family = "foo"
  container_definitions = <<DEFINITION
[
  {
    "cpu": 256,
    "essential": true,
    "image": "ghost:latest",
    "memory": 512,
    "name": "ghost",
    "portMappings": [
      {
        "containerPort": 2368,
        "hostPort": 8080
      }
    ]
  }
]
DEFINITION
}

resource "aws_iam_role" "ecs_service" {
  name = "foo"
  assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "ecs_service" {
  name = "foo"
  role = "${aws_iam_role.ecs_service.name}"
  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
        "elasticloadbalancing:DeregisterTargets",
        "elasticloadbalancing:Describe*",
        "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
        "elasticloadbalancing:RegisterTargets"
      ],
      "Resource": "*"
    }
  ]
}
EOF
}

resource "aws_lb_target_group" "test" {
  name = "foo"
  port = 80
  protocol = "HTTP"
  vpc_id = "${aws_vpc.main.id}"
}

resource "aws_lb" "main" {
  name            = "foo"
  internal        = true
  subnets         = ["${aws_subnet.main.*.id}"]
}

resource "aws_lb_listener" "front_end" {
  load_balancer_arn = "${aws_lb.main.id}"
  port = "80"
  protocol = "HTTP"

  default_action {
    target_group_arn = "${aws_lb_target_group.test.id}"
    type = "forward"
  }
}

resource "aws_ecs_service" "with_alb" {
  name = "foo"
  cluster = "${aws_ecs_cluster.main.id}"
  task_definition = "${aws_ecs_task_definition.with_lb_changes.arn}"
  desired_count = 1
  health_check_grace_period_seconds = %d
  iam_role = "${aws_iam_role.ecs_service.name}"

  load_balancer {
    target_group_arn = "${aws_lb_target_group.test.id}"
    container_name = "ghost"
    container_port = "2368"
  }

  depends_on = [
    "aws_iam_role_policy.ecs_service",
  ]
}

Expected Behavior

The ECS service should be created at the same time as the LB listener because they both depend on the LB target group. At this point the target group may not yet be attached to the load balancer because the LB listener resource hasn't finished being created. This throws an InvalidParameterException which before @bflad's change in https://github.com/terraform-providers/terraform-provider-aws/pull/3240 was then retried.

Actual Behavior

Now it just throws the error and doesn't retry.

Steps to Reproduce

terraform apply

References

GH-3240

Looks like we just need to re-add the InvalidParameterException retry but I'm wary of doing without understanding why @bflad removed it in the first place. We should probably remove that depends_on as well in the acceptance tests although I think it's still needed for the IAM role policy.

Note that I'm unable to add a depends_on to the listener rule because I have nested modules that has one module create an ECS service (potentially a worker based, non load balanced service) and another one that creates the load balancer and sets up security groups etc that uses the ECS service module, telling it to use the load balanced service resource. I can provide the config if necessary but ultimately I don't think we should be forcing people to put a depends_on for a race condition that will resolve itself if we simply retry for as much as a couple of seconds.

bflad commented 6 years ago

@tomelliff can you please provide the full InvalidParameterException error message?

Looks like we just need to re-add the InvalidParameterException retry but I'm wary of doing without understanding why @bflad removed it in the first place.

We removed the blanket InvalidParamaterException handling because there were plenty of cases it was retrying on errors that would never be fixed by waiting like pure misconfigurations. In those cases its poor user experience to have operators wait multiple minutes to discover those configuration errors. Since AWS uses the same error code between scenarios that might eventually work with retrying and misconfigurations, I think we want to strike a good balance to match on specific messages and conditions only.

I would personally say that banking on retries for Terraform eventually creating the appropriate resource is not an ideal scenario. There are many factors that would contribute to this error still occurring even if retries are put in place:

The number of other resources being created, their required time for creation, and Terraform parallelism
API rate limiting
API server-side errors requiring retries
LB listener misconfigurations preventing their creation

Adding the retries for the load balancer listener might provide some convenience and help sometimes, but it does not fix the underlying requirement for ECS wanting the explicit ordering of resources before it.

We should probably remove that depends_on as well in the acceptance tests although I think it's still needed for the IAM role policy.

I would personally disagree here as the explicit depends_on consistently orders the Terraform resource creation in the way that AWS expects and is the way we would recommend operators to setup their configurations as well as it prevents possible errors. If its not in the documentation, we should probably add it, even if the retries are added.

Until Terraform core supports us configuring something like waiting for all children resources of a parent resource to complete, the depends_on here is a necessary evil 🙈. We would recommend these explicit depends_on in various other places where AWS requires other layered dependencies (like depending on an Internet Gateway in a public ECS service configuration as AWS requires the subnets configured in the ECS service to have an Internet Gateway attached).

I hope this makes sense!

tomelliff commented 6 years ago

Full error was:

InvalidParameterException: The target group with targetGroupArn arn:aws:elasticloadbalancing:eu-west-1:123456789:targetgroup/targetGroupId does not have an associated load balancer

I think I saw something in the code base a long time back that looked at the message as well as the status code so maybe that could be used here if you want to minimise how wide that retry logic is? I'm not sure if there's a good way to have different retry timeouts in Terraform either but the listener creation is barely slower than instant so if so dropping that to a tiny amount would work for me.

The explanation makes sense but I'm stuck right now because I can't see a way of linking the listener from the parent module to the child module even in a hacky way so I can't force a dependency in Terraform without core enabling modules to depend on things and not do anything in the module until the depends_on is complete.

Right now 1.9.0 breaks any time we deploy a new environment/service and we need to retry. I could probably force Gitlab CI to retry the job automatically but I'd rather not have that there long term.

Longer term the plan is to move to a single ALB per ECS cluster and add listener rules for each service and environment once the auto priority PR is merged but I'm not sure if I'm going to have the same race condition there.

bflad commented 6 years ago

Sorry for the delay here. 😅 The pull request to allow some retries for this condition has been merged into master and will release with version 1.33.0 of the AWS provider, likely middle of next week.

Please note: we'll continue to recommend the usage of depends_on in the aws_ecs_service resource pointing to the aws_lb_target_group, where possible. 👍

bflad commented 6 years ago

This has been released in version 1.33.0 of the AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

jritsema commented 6 years ago

FWIW @bflad, I just tested this it and still got the error...

* aws_ecs_service.app: InvalidParameterException: The target group with targetGroupArn arn:aws:elasticloadbalancing:us-east-1:367555685970:targetgroup/depends-on-test-dev/b96ad7836a2fe8e7 does not have an associated load balancer.

resource "aws_ecs_service" "app" {
  ...
  load_balancer {
    target_group_arn = "${aws_alb_target_group.main.id}"
    container_name   = "${var.container_name}"
    container_port   = "${var.container_port}"
  }

  depends_on = [
    "aws_alb_target_group.main",
  ]

Tested with:

Terraform v0.11.8
+ provider.aws v1.33.0

HughWarrington commented 5 years ago

I believe this issue is not fixed. I get the above error on every run.

I'm using

Terraform v0.11.10
+ provider.aws v1.41.0

I can't add an explicit depends_on since my ALB is defined in a different module from the ECS service.

bleggett commented 5 years ago

Can confirm this is not fixed, or the retries are not working.

tomelliff commented 5 years ago

Can you share a minimal, complete example of Terraform code that errors out with that error? I've just ran the TestAccAWSEcsService_withAlb which has no depends_on on the ALB/listener and it seems to work fine.

I have since refactored how we deploy ECS services behind ALBs so don't have this issue any more but it was definitely working fine and no longer erroring as of up to a month ago for us when previously it would error regularly.

Test output:

=== RUN   TestAccAWSEcsService_withAlb
=== PAUSE TestAccAWSEcsService_withAlb
=== CONT  TestAccAWSEcsService_withAlb
--- PASS: TestAccAWSEcsService_withAlb (317.78s)
PASS

Process finished with exit code 0

HughWarrington commented 5 years ago

Thanks for your work on this so far.

Here's a much reduced version of my setup.

I'm afraid it's still fairly long since it uses a 3rd party module to create a VPC to isolate it from other resources.

Also I've kept the ECS service in a separate module myapp, since after reading related tickets it seems that could be a relevant detail.

./vars.tf

variable "aws_region" {
  default = "eu-central-1"
}

variable "azs" {
  default = [
    "eu-central-1a",
    "eu-central-1b",
  ]
}

variable "cidr_private" {
  default = [
    "10.0.1.0/24",
    "10.0.2.0/24",
  ]
}

variable "cidr_public" {
  default = [
    "10.0.101.0/24",
    "10.0.102.0/24",
  ]
}

./main.tf

provider "aws" {
  region  = "${var.aws_region}"
}

module "base_vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "1.46.0"

  name = "tf-vpc"
  cidr = "10.0.0.0/16"

  azs = [
    "${var.azs}",
  ]

  private_subnets = [
    "${var.cidr_private}",
  ]

  public_subnets = [
    "${var.cidr_public}",
  ]

  enable_nat_gateway = true
  single_nat_gateway = true
}

resource "aws_alb" "alb_for_fargate" {
  internal = false

  subnets = [
    "${module.base_vpc.public_subnets}",
  ]
}

resource "aws_ecs_cluster" "my_cluster" {
  name = "mycluster"
}

module "myapp_instance" {
  source = "myapp/"

  alb_arn          = "${aws_alb.alb_for_fargate.arn}"
  ecs_cluster_arn  = "${aws_ecs_cluster.my_cluster.arn}"
  private_subnets  = ["${module.base_vpc.private_subnets}"]
  vpc_arn          = "${module.base_vpc.vpc_id}"
}

./myapp/vars.tf

variable "alb_arn" {}

variable "ecs_cluster_arn" {}

variable "private_subnets" {
  type = "list"
}

variable "vpc_arn" {}

./myapp/main.tf

resource "aws_alb_target_group" "my_tg" {
  protocol    = "HTTP"
  port        = "80"
  vpc_id      = "${var.vpc_arn}"
  target_type = "ip"
}

resource "aws_ecr_repository" "my_repo" {
  name = "myapp"
}

resource "aws_ecs_task_definition" "my_td" {
  family                   = "myapp"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = 256
  memory                   = 512

  container_definitions = <<EOF
[
  {
    "name": "myapp",
    "image": "myapp:latest",
    "networkMode": "awsvpc",
    "portMappings": [
      {
        "containerPort": 80,
        "protocol": "tcp"
      }
    ],
    "requiresCompatibilities": [
        "FARGATE"
    ]
  }
]
EOF
}

resource "aws_ecs_service" "my_service" {
  name            = "myapp"
  cluster         = "${var.ecs_cluster_arn}"
  launch_type     = "FARGATE"
  task_definition = "${aws_ecs_task_definition.my_td.arn}"

  network_configuration = {
    subnets = ["${var.private_subnets}"]
  }

  load_balancer {
    target_group_arn = "${aws_alb_target_group.my_tg.arn}"
    container_name   = "myapp"
    container_port   = 80
  }
}

ezequielarevalo-natgeo commented 5 years ago

This is not fixed !!! Using load_balancer properties for aws_ecs_service fails with : validParameterException: The target group with targetGroupArn arn:aws:elasticloadbalancing:us-east-1:367555685970:targetgroup/depends-on-test-dev/b96ad7836a2fe8e7 does not have an associated load balancer. The problem is the load_balancer option for aws_ecs_service, it doesn't matter if it depends on target group creation, it fails trying to attaching the service to the target group, if you run this twice, the second one will work because target group was already attached for the first run.

hakro commented 5 years ago

I still have this issue too. It fails the first time, but the second time, it works. Terraform v0.12.8

provider.aws v2.32.0

ghost commented 4 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

hashicorp / terraform-provider-aws