terraform-aws-modules / terraform-aws-ecs

Terraform module to create AWS ECS resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/ecs/aws
Apache License 2.0
575 stars 542 forks source link

The recommended workaround for ignoring task definition changes causes the service's container_definitions to be overwritten on every Terraform apply, even ones that don't touch the service. #165

Open andrewedstrom opened 9 months ago

andrewedstrom commented 9 months ago

Description

My team triggers deploys to ECS via Github actions, and we did not want to have to update Terraform whenever we do a deploy.

We could not use the ignore_task_definition_changes, because that also (surprisingly) causes changes to the load balancer to be ignored (see https://github.com/terraform-aws-modules/terraform-aws-ecs/issues/154 and https://github.com/terraform-aws-modules/terraform-aws-ecs/issues/152).

According to the design doc, there is an officially supported workaround:

As an alternative, this module does provide a workaround that would support an external party making changes to something like the image of the container definition. In a scenario where the service, task definition, and container definition are all managed by Terraform, the following configuration could be used to allow an external party to change > the image of the container definition without conflicting with Terraform, provided that the external party is also > updating the image tag in a shared location that can be retrieved by Terraform (here, we are using SSM Parameter > Store): [example follows]

We ended up going with this workaround, however it has an undesirable bug. Now, after a deploy via GitHub actions, whenever we make any change to our Terraform (even if it doesn't touch the ECS service in any way), the container_definitions of the service are recreated.

This does not cause downtime, but it does cause extra noise whenever we make a change to our terraform, and makes it hard to know during a terraform plan if the container definition has actually changed or if it's just being recreated without anything changing.

Versions

Reproduction Code [Required]

Unfortunately, I do not have the time to make a fully reproducible example. However, the official docs don't even provide a full example of how to implement the workaround, so I think it's fair for me to provide an example that's much more full than the official docs and still shows the general shape of how this works.

Based on the comments in https://github.com/terraform-aws-modules/terraform-aws-ecs/issues/152, I get the impression that many teams want to use a similar setup, deploying with GitHub Actions, so it may be helpful to actually add something like this to the readme or examples/ directory. Feel free to borrow from the code I've shared below as needed.

ECS terraform:

# Cluster
module "ecs_cluster" {
  source       = "terraform-aws-modules/ecs/aws//modules/cluster"
  version      = "5.8.0"
  cluster_name = local.name

  default_capacity_provider_use_fargate = true
  fargate_capacity_providers            = {
    FARGATE = {
      default_capacity_provider_strategy = {
        weight = 100
        base   = 1
      }
    }
  }

  tags = local.tags
}

# Service
data "aws_ssm_parameter" "api_image_tag" {
  name = local.api_image_tag_parameter_name
}

module "api_ecs_service" {
  source      = "terraform-aws-modules/ecs/aws//modules/service"
  version     = "5.8.0"
  name        = "${local.name}-api-service"
  cluster_arn = module.ecs_cluster.arn

  cpu    = 512
  memory = 1024

  subnet_ids = local.private_subnets

  enable_execute_command = false

  # At least 2 tasks should be running at all times
  # autoscaling_min_capacity is actually the thing that ensures that at least 2 tasks are running at all times
  # desired_count is actually ignored and setting it is a no-op (https://github.com/terraform-aws-modules/terraform-aws-ecs/blob/master/docs/README.md#service-1), but we set it anyway out of superstition
  desired_count            = 2
  autoscaling_min_capacity = 2

  load_balancer = {
    service = {
      target_group_arn = module.alb.target_groups["ecs_api_server"].arn
      container_name   = local.container_name
      container_port   = local.container_port
    }
  }
  security_group_rules = {
    alb_ingress_3000 = {
      type                     = "ingress"
      from_port                = local.container_port
      to_port                  = local.container_port
      protocol                 = "tcp"
      description              = "Service port"
      source_security_group_id = module.alb.security_group_id
    }
    egress_all = {
      type        = "egress"
      from_port   = 0
      to_port     = 0
      protocol    = "-1"
      cidr_blocks = ["0.0.0.0/0"]
    }
  }
  health_check_grace_period_seconds = 30

  ignore_task_definition_changes = false

  container_definitions = {
    (local.container_name) = {
      enable_cloudwatch_logging = true
      name                      = local.container_name
      image                     = "${local.ecr_repository}:${data.aws_ssm_parameter.api_image_tag.value}"
      cpu                       = 512
      memory                    = 1024
      privileged                = false
      essential                 = true

      port_mappings = [
        {
          name          = local.container_name
          containerPort = local.container_port
          hostPort      = local.container_port
          protocol      = "tcp"
        }
      ]
      environment = []
      secrets     = []
    }
  }
}

# Load Balancer
module "alb" {
  source  = "terraform-aws-modules/alb/aws"
  version = "~> 9.0"

  name = "api-load-balancer"

  load_balancer_type = "application"

  vpc_id  = local.vpc_id
  subnets = local.public_subnets

  enable_deletion_protection   = false
  security_group_ingress_rules = {
    all_http = {
      from_port   = 80
      to_port     = 80
      ip_protocol = "tcp"
      cidr_ipv4   = "0.0.0.0/0"
    }
    all_https = {
      from_port   = 443
      to_port     = 443
      ip_protocol = "tcp"
      description = "HTTPS web traffic"
      cidr_ipv4   = "0.0.0.0/0"
    }
  }
  security_group_egress_rules = {
    all = {
      ip_protocol = "-1"
      cidr_ipv4   = local.vpc_cidr_block
    }
  }

  listeners = {
    api_server_http_https_redirect = {
      port     = 80
      protocol = "HTTP"
      redirect = {
        port        = "443"
        protocol    = "HTTPS"
        status_code = "HTTP_301"
      }
    }
    api_server_https = {
      port            = 443
      protocol        = "HTTPS"
      certificate_arn = aws_acm_certificate.ecs_api.arn

      forward = {
        target_group_key = "ecs_api_server"
      }
    }
  }

  web_acl_arn = local.api_acl_arn

  target_groups = {
    ecs_api_server = {
      backend_protocol                  = "HTTP"
      backend_port                      = local.container_port
      target_type                       = "ip"
      deregistration_delay              = 5
      load_balancing_cross_zone_enabled = true

      health_check = {
        enabled             = true
        healthy_threshold   = 5
        interval            = 30
        matcher             = "200"
        path                = "/health"
        port                = "traffic-port"
        protocol            = "HTTP"
        timeout             = 5
        unhealthy_threshold = 4
      }

      # There's nothing to attach here in this definition. Instead,
      # ECS will attach the IPs of the tasks to this target group
      create_attachment = false
    }
  }
}

GitHub action to deploy:

name: Staging Deploy
on:
  push:
    branches:
      - main

concurrency:
  group: staging-deploy
  cancel-in-progress: false

permissions:
  id-token: write
  contents: read

env:
  ENV_NAME: ...
  ECS_CLUSTER: ...
  ECS_SERVICE: ...
  AWS_REGION: ...
  ROLE_TO_ASSUME: ...
  ECR_IMAGE: ...
  IMAGE_TAG_PARAMETER_NAME: ...
  CONTAINER_NAME: ... # from the container_definitions section of the task definition
  TASK_DEFINITION_FILE: task-definition.json # Not committed to source. Actually downloaded from ECS during the job

jobs:
  build:
    name: Build Docker Image
    runs-on: ubuntu-latest
    outputs:
      image: ${{ steps.build-image.outputs.image }}
      tag: ${{ steps.build-image.outputs.tag }}

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Build image
        id: build-image
        run: |
          TAG="staging-$(git rev-parse --short HEAD)"
          FULL_IMAGE_NAME="${ECR_IMAGE}:${TAG}"
          docker build -t "${FULL_IMAGE_NAME}" .
          echo "image=${FULL_IMAGE_NAME}" >>  "${GITHUB_OUTPUT}"
          echo "tag=${TAG}" >>  "${GITHUB_OUTPUT}"

      - name: Configure AWS credentials for image push
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ env.ROLE_TO_ASSUME }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to Amazon ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Push image to Amazon ECR
        run: |
          docker push "${ECR_IMAGE}" --all-tags

  deploy-backend:
    name: Deploy API Server to ECS
    runs-on: ubuntu-latest
    needs: build
    env:
      AWS_REGION: us-east-2

    steps:
      - name: Configure AWS credentials for deployment
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ env.ROLE_TO_ASSUME }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Download existing task definition
        run: |
          aws ecs describe-task-definition --task-definition "${ECS_SERVICE}" --query taskDefinition > "${TASK_DEFINITION_FILE}"

      - name: Fill in the new image ID in the Amazon ECS task definition
        id: task-def
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: ${{ env.TASK_DEFINITION_FILE }}
          container-name: ${{ env.CONTAINER_NAME }}
          image: ${{ needs.build.outputs.image }}

      - name: Deploy Amazon ECS task definition
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: true

      - name: Update SSM parameter of image
        run: |
          aws ssm put-parameter --name "${{ env.IMAGE_TAG_PARAMETER_NAME }}" --value "${{ needs.build.outputs.tag }}" --type String --overwrite

Expected behavior

After applying this terraform and then deploying from GitHub, if I change an unrelated .tf file in the same terraform project, then when I run terraform plan, no changes should be shown in my container definition.

Actual behavior

The container definition gets updated and the service is recreated.

I am 100% sure this is not due to task definition drift outside of GitHub Actions and Terraform, as the task definition has never been updated except by Github Actions and Terraform. We deploy to multiple environments using the same scripts and this bug appears in all of those environments.

Terminal Output Screenshot(s)

I cannot share a screenshot, but when I run terraform plan, it shows these two resources getting changed

Additional context

bolshakoff commented 9 months ago

I can relate to you. 🙂 I ended up simply running this before working with terraform:

terraform apply -refresh-only

This reconciles the state with reality a bit, and so the subsequent terraform apply does not cause re-deploy.

bolshakoff commented 9 months ago

Another option is to keep an eye on #171.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

jakub-pietrzak commented 7 months ago

I can relate to you. 🙂 I ended up simply running this before working with terraform:

terraform apply -refresh-only

This reconciles the state with reality a bit, and so the subsequent terraform apply does not cause re-deploy.

This workaround does not seem to work for me.

Here is the output of refresh-only command:

Note: Objects have changed outside of Terraform

Terraform detected the following changes made outside of Terraform since the last "terraform apply" which may have affected this plan:

  # module.app_env.module.ecs.module.service_images["xxx"].aws_ssm_parameter.ignore_value[0] has changed
  ~ resource "aws_ssm_parameter" "ignore_value" {
        id             = "xxx"
      ~ insecure_value = "xxx.dkr.ecr.eu-central-1.amazonaws.com/xxx:version1" -> "xxx.dkr.ecr.eu-central-1.amazonaws.com/xxx:version2"
        name           = "xxx"
        tags           = {}
      ~ version        = 4 -> 5
        # (5 unchanged attributes hidden)
    }

  # module.app_env.module.ecs.module.ecs.module.service["xxx"].aws_ecs_service.this[0] has changed
  ~ resource "aws_ecs_service" "this" {
        id                                 = "arn:aws:ecs:eu-central-1:xxx:service/xxx/xxx"
        name                               = "xxx"
        tags                               = {}
      ~ task_definition                    = "xxx:18" -> "xxx:19"
        # (14 unchanged attributes hidden)

        # (6 unchanged blocks hidden)
    }

When I run it again, terraform detects no changes, however a regular plan/apply tries to update service & task def

  # module.app_env.module.ecs.module.ecs.module.service["xxx"].data.aws_ecs_task_definition.this[0] will be read during apply
  # (depends on a resource or a module with changes pending)
 <= data "aws_ecs_task_definition" "this" {
      + arn                  = (known after apply)
      + arn_without_revision = (known after apply)
      + execution_role_arn   = (known after apply)
      + family               = (known after apply)
      + id                   = (known after apply)
      + network_mode         = (known after apply)
      + revision             = (known after apply)
      + status               = (known after apply)
      + task_definition      = "xxx"
      + task_role_arn        = (known after apply)
    }

  # module.app_env.module.ecs.module.ecs.module.service["xxx"].aws_ecs_service.this[0] will be updated in-place
  ~ resource "aws_ecs_service" "this" {
        id                                 = "arn:aws:ecs:eu-central-1:xxx:service/xxx/xxx"
        name                               = "xxx"
        tags                               = {}
      ~ task_definition                    = "xxx:19" -> (known after apply)
        # (14 unchanged attributes hidden)

        # (6 unchanged blocks hidden)
    }

  # module.app_env.module.ecs.module.ecs.module.service["xxx"].aws_ecs_task_definition.this[0] must be replaced
+/- resource "aws_ecs_task_definition" "this" {
      ~ arn                      = "arn:aws:ecs:eu-central-1:xxx:task-definition/xxx:18" -> (known after apply)
      ~ arn_without_revision     = "arn:aws:ecs:eu-central-1:xxx:task-definition/xxx" -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [
              ~ {
                  ~ image                  = "xxx.dkr.ecr.eu-central-1.amazonaws.com/xxx-xxx:version1" -> "xxx.dkr.ecr.eu-central-1.amazonaws.com/xxx-xxx:version2"
                    name                   = "xxx"
                    # (17 unchanged attributes hidden)
                },
            ] # forces replacement
        )
      ~ id                       = "xxx" -> (known after apply)
      ~ revision                 = 18 -> (known after apply)
      - tags                     = {} -> null
        # (10 unchanged attributes hidden)

        # (2 unchanged blocks hidden)
    }

Even if I manually register new task definition in the aws console, the following change would still cause the service to be redeployed

  # module.app_env.module.ecs.module.ecs.module.service["xxx"].aws_ecs_service.this[0] will be updated in-place
  ~ resource "aws_ecs_service" "this" {
        id                                 = "arn:aws:ecs:eu-central-1:xxx:service/xxx/xxx"
        name                               = "xxx"
        tags                               = {}
      ~ task_definition                    = "xxx:20" -> "xxx:21"
        # (14 unchanged attributes hidden)

        # (6 unchanged blocks hidden)
    }

fyi I also apply the example with the latest image version stored in SSM.

pgusa commented 3 months ago

I am working on a similiar task where our CI pipeline updates the image and creates a new revision in the console. SInce terraform is not aware of this change and still have old state file with some old revision, it tries to force replace the task definition.

I am currently trying to implement the logic where I can ignore these changes in my task definition module, but I cannot implement it smartly as I want user to choose if these changes can be ignored. Lifecycle block is useless here and I am looking for suggestions if anyone knows about the use case where solution like this has been implemented.

tetienne commented 3 months ago

@andrewedstrom Did you find a workaround since February? I faced exactly the situation.

guidodobboletta commented 2 months ago

I'm also facing the same issue as OP. I have a centralized place where terraform and my CI/CD grab the same image value yet this reports that there is changes all the time.

I checked the task definition generated by CI/CD and the one generated by terraform and they are the exact same the only difference is the registeredAt and registeredBy and revision. I tried adding tofu refresh before the actual terraform plan and we still see the same problem.

tetienne commented 2 months ago

@bryantbiggs Sorry to ping you, but have you any idea how we can deal with this issue? Is there any workaround to this workaround? Allowing to deploy through the CI and managing the task through Terraform sounds really perfect.

bryantbiggs commented 2 months ago

unfortunately, I think we have exhausted the various hacks and avenues for trying to deal with this at a module/resource level. the solution needs to ultimately come from the ECS API - I don't know if there is anything that could be done at the Terraform AWS provider level

tetienne commented 2 months ago

Thx @bryantbiggs for your feedback. Is there anyplace where we can ask for such improvement? Out of curiosity, this workaround was working at somepoint?

bryantbiggs commented 2 months ago

I would highly encourage you to reach out to your AWS account team and provide them feedback on what you are trying to do with ECS and the challenges you are facing. This is how Amazon works (working backwards from the customer) and its how we will drive actual change from the service side, instead of pursuing hacks through the limited set of functionality we have at our disposal at the resource/module level

tetienne commented 2 months ago

Thank you for the recommendation, I’ve sent a message to our AWS support team. Regarding this issue, here’s another discussion thread where it's being extensively talked about: https://github.com/hashicorp/terraform-provider-aws/issues/632

tetienne commented 2 months ago

@bryantbiggs So the AWS support ask me to create an issue within the ECS roadmap repository. The point is I don’t know at all what to ask. I understood there is a lack within the ECS API, but which one exactly? May I ask you to create it? Perhaps the fact you work at AWS can help increasing the priority or help up to find a workaround.

bryantbiggs commented 2 months ago

There are already various existing requests on the containers roadmap, I've listed a few below that I believe are related/relevant:

bryantbiggs commented 9 hours ago

we will attempt a change in v6.0 - but that attempt is to remove hacks and move the discussion further upstream. That could be the AWS provider or with the service itself