aws_ecs_task_definition and continuous delivery to ecs

hashibot commented 7 years ago

This issue was originally opened by @dennari as hashicorp/terraform#13005. It was migrated here as part of the provider split. The original body of the issue is below.

With the task and container definition data sources I'm almost able to get our continuous delivery setup to play nicely with Terraform. We rebuild the docker image with a unique tag at every deployment. This means that after the CI service redeploys a service, the corresponding task definition's revision is incremented and the image field in a container definition changes.

I dont' seem to be able to create a setup where the task definition could be managed by Terraform in this scenario.

Terraform Version

v0.9.1

Affected Resource(s)

resource aws_ecs_task_definition
Terraform Configuration Files

# Simply specify the family to find the latest ACTIVE revision in that family.
data "aws_ecs_task_definition" "mongo" {
  task_definition = "${aws_ecs_task_definition.mongo.family}"
}
data "aws_ecs_container_definition" "mongo" {
  task_definition = "${data.aws_ecs_task_definition.mongo.id}"
  container_name  = "mongodb"
}

resource "aws_ecs_cluster" "foo" {
  name = "foo"
}

resource "aws_ecs_task_definition" "mongo" {
  family = "mongodb"
  container_definitions = <<DEFINITION
[
  {
    "cpu": 128,
    "environment": [{
      "name": "SECRET",
      "value": "KEY"
    }],
    "essential": true,
    "image": "${aws_ecs_container_definition.mongo.image}",
    "memory": 128,
    "memoryReservation": 64,
    "name": "mongodb"
  }
]
DEFINITION
}

resource "aws_ecs_service" "mongo" {
  name          = "mongo"
  cluster       = "${aws_ecs_cluster.foo.id}"
  desired_count = 2
  # Track the latest ACTIVE revision
  task_definition = "${aws_ecs_task_definition.mongo.family}:${max("${aws_ecs_task_definition.mongo.revision}", "${data.aws_ecs_task_definition.mongo.revision}")}"
}

The problem is then that after a CI deployment, terraform would like to create a new task definition. The task definition resource here points to an earlier revision and the image field is considered changed.

With the deprecated template resources, I was able to ignore changes to variables which solved this issue. One solution that comes to mind would be the ability to set revision of the aws_ecs_task_definition resource.

I'd be grateful for any and all insights.

kurtwheeler commented 7 years ago

I have run into this issue as well. I think the solution I am going to go with is to not have the task definition be managed by terraform. Circle CI has a blog post about how to push a new task definition via a script they provide.

I agree that the ability to set the aws_ecs_task_definition would enable managing the task definition via Terraform. However philosophically it does seem to break Terraform's model of having resource blocks correspond to resources within AWS. If that parameter were added then a single aws_ecs_task_definition resource block would be responsible for creating multiple AWS resources.

JDiPierro commented 7 years ago

I've gotten around this by using terraform taint as part of the deploy process:

Push the container to the ECR
terraform taint the task-def
terraform apply makes a new revision and updates the service

naveenb29 commented 7 years ago

@JDiPierro While using the taint solution - does it not kill the current task and replace rather than deploy the new task and then drain the old one ?

JDiPierro commented 7 years ago

@naveenb29 Nope. I believe that would be the case if you were tainting the ECS Service. Since just the task-def is being recreated the ECS service is updated causing the new tasks to deploy. ECS waits for them to become healthy and then kills the old containers.

tomelliff commented 7 years ago

Our CI process is tagging the image as it's pushed to ECR and then passing that tag to the task definition. This automatically leads to the task definition changing so that Terraform knows to recreate it and then that's linked to the ECS service causing it to update the service with the new task definition.

I think I'm missing the issue that others are having here.

That said, I'd like for Terraform to (optionally) wait for the deployment to complete: new task definition has running tasks equal to the desired count and, potentially, that the other task definitions have been deregistered so no new traffic will reach them. I can't see a nice way to get at that information using the API though as the events don't really expose enough information so you'd probably have to describe the old task def, find the running tasks using it, find the ports they were running on, check the ALB they are registered to has all of those ports set as draining.

For now I'm simply shelling out and waiting until the PRIMARY service deployment has a running count equal to the desired count (doesn't catch the short period from PRIMARY tasks being registered to old tasks being deregistered) or waiting until the deployment list has a length of 1 (all old task definitions have been completely drained which is overkill as new connections won't arrive there so the deployment can be considered completed before this).

radeksimko commented 7 years ago

Hi folks, see more detailed explanation of why this is happening at https://github.com/terraform-providers/terraform-provider-aws/issues/13#issuecomment-318700677

Labelling appropriately...

dmikalova commented 7 years ago

I was able to solve the inactive task definition issue with the example in the ECS task definition data source. You set up the ECS service resource to use the the max revision of either what your Terraform resource has created, or what is in the AWS console which the data source retrieves.

The one downside to this is if someone changes the task definition, Terraform will not realign that to what's defined in code.

Esya commented 6 years ago

What do you guys think about having a remote backend (It's S3 in my case), and having your CI pipeline create the new task definition and changing the .tfstate file directly to match it ?

For example, mine looks like this :

"aws_ecs_task_definition.backend_service": {
    "type": "aws_ecs_task_definition",
    "depends_on": [
        "data.template_file.task_definition"
    ],
    "primary": {
        "id": "backend-service",
        "attributes": {
            "arn": "arn:aws:ecs:eu-west-1:REDACTED:task-definition/backend-service:8", // This could be changed manually
            "container_definitions": "REDACTED",
            "family": "backend-service",
            "id": "backend-service",
            "network_mode": "",
            "placement_constraints.#": "0",
            "revision": "8", // This could be increased manually
            "task_role_arn": ""
        },
        "meta": {},
        "tainted": false
    },
    "deposed": [],
    "provider": ""
},

Couldn't we just change the arn and the revision, so that the next time terraform runs, it still thinks it has the "latest version" of the task defintion in it's state ?

chriswhelix commented 6 years ago

I'm not sure I understand the problem y'all are trying to solve here. Why not just use terraform to create the new task definition in the first place, and then your tf state is always consistent? Our setup is similar to what @tomelliff describes.

Esya commented 6 years ago

@chriswhelix Well in my particular case, I have two separate repositories. One that holds the terraform project, and it creates my ECS cluster, my services, and the initial task definition.

The other one is for a specific service, and I'd like to have some CI/continuous delivery flow in place (Using gitlab pipelines in my case) to "containerize" the project, push it to ECR, and trigger a service update on my ECS cluster. (Edit: as a reminder, currently, if we use the aws cli to do this as part of our CI workflow, then the next terraform run will overwrite the task def.)

So, when you say "use terraform to create the new task definition in the first place", are you implying that on our CI system, when pushing our service's code, we should also clone our terraform repo, change the variable that holds the image tag or that service, do a terraform apply, and commit + push to the TF repository?

tl;dr: Need a way to trigger service updates from any of our projects' build pipeline, without any user interaction with terraform.

chriswhelix commented 6 years ago

@Esya what we do is that each project has in its build config the version of the terraform repo it is expecting to be deployed with. When the CI pipeline is ready to deploy, it pulls down the terraform repo using the git tag specified in the project build config, then runs terraform against that, providing the image tag it just wrote to the ECR repo as an input variable.

We don't write down the ECR image tag in the terraform repo; it must be provided each time terraform is run. So that avoids simple code updates to projects requiring any change to the terraform repo.

schmod commented 6 years ago

I'm using ecs-deploy in my deployment pipeline, and a terraform config that looks something like this:


# Gets the CURRENT task definition from AWS, reflecting anything that's been deployed
# outside of Terraform (ie. CI builds).
data "aws_ecs_task_definition" "task" {
  task_definition = "${aws_ecs_task_definition.main.family}"
}

# "Dummy" application for initial deployment
data "aws_ecr_repository" "sample" {
  name = "sample"
}

# ECR Repo for our actual app
data "aws_ecr_repository" "main" {
  name = "${var.ecr_name}"
}

resource "aws_ecs_task_definition" "main" {
  family = "${var.name}"
  task_role_arn = "${module.iam_roles.ecs_service_deployment_role_arn}"

  container_definitions = <<DEFINITION
[
  {
    "name": "${var.name}",
    "image": "${data.aws_ecr_repository.sample.repository_url}:latest",
    "essential": true,
    "portMappings": [{
      "containerPort": ${var.container_port},
      "hostPort": 0
    }]
  }
]
DEFINITION
}

resource "aws_ecs_service" "main" {
  name = "${var.name}"
  cluster = "${var.cluster}"
  desired_count = 2
  task_definition = "${aws_ecs_task_definition.main.family}:${max("${aws_ecs_task_definition.main.revision}", "${data.aws_ecs_task_definition.task.revision}")}"
  iam_role = "${module.iam_roles.ecs_service_deployment_role_arn}"

}

During the initial deployment, Terraform deploys an "empty" container. When the CI pipeline runs, ecs-deploy creates a new task definition revision with the newly-built image/tag, and updates the service accordingly.

Terraform recognizes these new deployments via data.aws_ecs_task_definition.task, and doesn't attempt to overwrite them.

HOWEVER, if other parts of the task definition change, Terraform will redeploy the sample application, as it'll try to create a new revision of the task definition (using the config containing the sample application). Hypothetically, data.aws_ecs_container_definition could be used to pull the image of the currently-active task definition. However, I haven't been able to figure out a way to use this that doesn't create a circular dependency or result in a chicken/egg problem during the initial deployment (ie. the data source is looking for a task definition that hasn't been created yet):

data "aws_ecs_task_definition" "task" {
  task_definition = "${aws_ecs_task_definition.main.family}"
}

data "aws_ecs_container_definition" "task" {
  task_definition = "${data.aws_ecs_task_definition.task.id}"
  container_name  = "${var.name}"
}

resource "aws_ecs_task_definition" "main" {
  family = "${var.name}"
  task_role_arn = "${module.iam_roles.ecs_service_deployment_role_arn}"

  container_definitions = <<DEFINITION
[
  {
    "name": "${var.name}",
    "image": "${data.aws_ecs_container_definition.task.image}",
  }
]
DEFINITION
}

This creates a cycle, and won't work during the initial deployment.

This is very close to my ideal setup. If Terraform somehow supported a "get or create" data/resource hybrid, I'd be able to do almost exactly what I'm looking for.

chriswhelix commented 6 years ago

@schmod you could possibly use a var to create a special bootstrapping mode, i.e. "count = var.bootstrapping ? 0 : 1" to turn on/off the data sources, and coalesce(data.aws_ecs_container_definition.task.*.image, "fake_image") on the task def.

I feel like if you're going to manage a particular resource with terraform, it's really best to make all modifications to it using terraform, though. If you solve this issue for container images, you're just going to have it again for service scaling, and again for environment changes, and again for anything else ecs-deploy does behind terraform's back.

What we really need is good deployment tools that work with terraform instead of around it.

schmod commented 6 years ago

Given the scope of what Terraform is allowed to do to my AWS resources, I'm rather apprehensive about running it in an automated/unmonitored environment. On the other hand, I can control exactly what ecs-deploy is going to do.

Infrastructure deployments and application deployments are very different in my mind. There's a fairly large and mature ecosystem around the latter, and I don't think that Terraform should need to reinvent that wheel. It should merely provide a configuration interface to tell it the exact set of changes that I expect those external tools to make.

We already have a version of that in the form of the ignore_changes lifecycle hook. My problem could also be solved if we supported container_definition as a 1st-class citizen (similar to aws_iam_policy_document), and allow something like ignore_changes=["container_definition.image"].

chriswhelix commented 6 years ago

@schmod isn't the real issue what your build agent is permissioned to do? If your build agent has least privileges for the changes you actually want it to make, shouldn't matter which tool makes them.

I agree that the interface between terraform and existing deployment tools seems like a generally awkward area. We've dealt with that mostly by just writing our own deployment scripts, in conjunction with what ECS provides out of the box. I'm not sure it's a problem that's solvable solely by changes to terraform, though; in this case, the fundamental problem is that there's no clean divide between the "infrastructure" part of ECS and the "application" part of ECS. That's really Amazon's fault, not terraform's.

There is a clean boundary at the cluster level -- i.e. it would be easy to have terraform manage all the backing instances for an ECS cluster, and another tool manage all the services and tasks running on the cluster. If your basic philosophy is a strong divide between "infrastructure" and "applications", it seems like drawing that line right through the middle of a task definition creates much too complicated a boundary to easily manage.

schmod commented 6 years ago

Right. The problem is that (in my use-case, and probably most others) an application deployment should change exactly one parameter on the task definition (image).

It's difficult to draw a line around the task definition, however, because it contains a lot of other configuration that I'd really prefer to remain static (and managed by Terraform). This makes it unattractive to draw a clean boundary at the cluster level (and also leaves both your Service and Task Definition completely unmanaged by Terraform).

As I mentioned earlier, the ignore_changes flag has been used elsewhere to help accommodate similar use-cases, and there's probably room to build out support for that in a way that shouldn't require fundamentally changing how Terraform works.

dev-head commented 6 years ago

we share the same use case as most people are reporting here.

our deployments are uniquely tagged, which requires a new task definition to update the ecs service on each deployment. This happens outside of the control of terraform due to a variety of reasons which are not important to the issue at hand.

Seems like we need the ability to ignore changes on aws_ecs_service resource; we can't do that right now due to TF not supporting interpolations in lifecycle blocks as this resource is part of shared module.

damscott commented 6 years ago

I worked around this by using a bash script in an External Data Source to return the current image for the container definition. If the script gets an error looking up the task definition then it assumes this is the initial infrastructure deployment and it uses a default value.

resource "aws_ecs_task_definition" "task" {
  family = "${var.app}-${var.env}"
  task_role_arn = "${aws_iam_role.app_role.arn}"
  container_definitions = <<JSON
[
  {
    "name": "${var.app}",
    "image": "${aws_ecr_repository.app_repo.repository_url}:${data.external.current_image.result["image_tag"]}"
  }
]
JSON
}

data "external" "current_image" {
  program = ["bash", "${path.module}/ecs-get-image.sh"]
  query = {
    app = "${var.app}"
    cluster = "${var.cluster_id}"
  }
}

ecs-get-image.sh:

#!/bin/bash

# This script retrieves the container image running in the current <app>-<env>
# If it can't get the image tag from AWS, assume this is the initial
# infrastructure deployment and default to "latest"

# Exit if any of the intermediate steps fail
set -e

# Get parameters from stdin
eval "$(jq -r '@sh "app=\(.app) cluster=\(.cluster)"')"

taskDefinitionID="$(aws ecs describe-services --service $app --cluster $cluster | jq -r .services[0].taskDefinition)"

# Default to "latest" if taskDefinition doesn't exist
if [[ ! -z "$taskDefinitionID" && "$taskDefinitionID" != "null" ]]; then {
  taskDefinition="$(aws ecs describe-task-definition --task-definition $taskDefinitionID)"
  containerImage="$(echo "$taskDefinition" | jq -r .taskDefinition.containerDefinitions[0].image)"
  imageTag="$(echo "$containerImage" | awk -F':' '{print $2}')"
} else {
  imageTag="latest"
}
fi

# Generate a JSON object containing the image tag
jq -n --arg imageTag "$imageTag" '{"image_tag":$imageTag}'

exit 0

It triggers a new task definition in Terraform when anything in the container_definition besides the image is changed, so we can still manage memory, cpu, etc, from Terraform, and it plays nicely with our CI (Jenkins) which pushes new images to ECR and creates new task definitions to point to those images.

It may need some reworking to support running multiple containers in a single task.

Edit:

If you are using the same image tag for every deployment (e.g. "latest", "stage") then this will revert to whatever task definition is in the state file. It doesn't break anything, but it is confusing. A work around for this can be done by creating an external data source similar to this one that returns the current task definition running in AWS to the aws_ecs_service if the image tag hasn't changed.

Edit 2: I updated the script and tf file to also return the task definition revision number. This lets us use a ternary on aws_ecs_service.task_definition to always use the most current revision, eliminating the issue where it rolled back the task definition if you always use the same image tag. I put the updated code in a gist: https://gist.github.com/damscott/9da8f2e623cac61423bb6a05839b10a9

This still does not support multiple containers in a single task definition.

I also want to say thanks to endofcake; I looked at your python version and took a stab at rewriting my code in python. I learned a lot, but ultimately stuck with bash because it's less likely to introduce dependency issues.

endofcake commented 6 years ago

I've also used an external data source as a workaround. The main difference is that it's written in Python, supports multiple containers in the task definition, and does not fall back to latest (it's the responsibility of Terraform).

The script is here: https://gist.github.com/endofcake/4ea2ac5c030a37965b65c7591c83a047

Here's a snippet of Terraform configuration that uses it:

data "external" "active_image_versions" {
  program = ["python", "/scripts/get_image_tags.py"]

  query = {
    cluster_name = "${data.terraform_remote_state.ecs.ecs_cluster_id}"
    service_name = "${var.app_name}"
  }
}

<...>
data "template_file" "task_definition" {
  template = "${file("task_definitions/sample.tpl")}"

  vars {
    # lookup the image in the external data source output and default to 'latest' if not found
    app_image              = "${aws_ecr_repository.sample.repository_url}:${lookup(data.external.active_image_versions.result, var.app_name, "latest")}"
    proxy_image            = "${aws_ecr_repository.sample.repository_url}:${lookup(data.external.active_image_versions.result, var.proxy_name, "latest")}"
  }
}

This solved the problem with Terraform trying to reset the image in the task definition to the one it knew about. However, after an app deployment which happens outside of Terraform it still detects changes in the task definition the next time it runs. It then creates a new task revision, which triggers a bounce of the ECS service - essentially a redeployment of the same image version. I could find no way to prevent it from doing this so far.

endofcake commented 6 years ago

After some investigation it looks to me like the problem is caused by Terraform not knowing about the new task revision.

Say, the last revision it knows about is 36. This is the version stored in it remote state, it's also the revision used by the ECS service as far as Terraform is concerned. The currently active revision is meanwhile 38, and it uses a new Docker image. With the workarounds like above Terraform is able to grab this image version by describing the current ECS task definition, but it then tries to create a new task revision with it, which in turn triggers a redeployment of the ECS service.

This lack of clear separation between infrastructure and application deployments turns out rather problematic, and I'm not sure how we can work around that.

mboudreau commented 6 years ago

How has this not been resolved yet? ECS has been around for a while and CI deployments outside of terraform seems like the standard operating procedure, and yet here I am still trying to get this new deployment working...

endofcake commented 6 years ago

See also this approach, which looks more promising: https://github.com/terraform-providers/terraform-provider-aws/pull/3485

codergolem commented 6 years ago

Hi everyone,

Sorry but I am struggling to understand the problem most of people are having, namely : why do you want to avoid a new task definition revision being created when the image has changed? is not that the standard way of deploying a new image to ecs, or how are you doing it otherwise?

endofcake commented 6 years ago

@codergolem , it's not about avoiding the new task definition, it's about making Terraform play nicely with changes which happen outside of Terraform, namely application deployments in ECS. Terraform is an infrastructure management tool, and just doesn't cut it when it comes to application deployments, even if we bolt on wait-conditions on it. This really looks more like a problem with AWS API than with Terraform, but so far I see no way to resolve this impedance mismatch cleanly.

mboudreau commented 6 years ago

@codergolem To put @endofcake reply into context, let me provide our example:

We have terraform build an ECS cluster, ECR, security groups, roles, services, tasks, etc. This is to make sure our stack is solid and reproducible between dev/prod and between our different regions.
After the cluster/services are created, we then use a CI server (in this case TravisCI) to build our docker image, push it to ECR, then use the AWS CLI to update the task definition to use the latest built image that we just pushed to ECR.

We do it this way for several reasons:

The code doesn't really have to know much about the infrastructure, just the ECR name, the cluster name and the task to update.
The terraform template doesn't have to care about the code or which version is considered the "latest" in the ECR docker image
It's simpler to have a single process to deploy a new version, instead of have one create the new version, then another process to update the task with the latest version.
Security; if you want to deploy this in a single process using Terraform, the AWS user that needs be in use on TravisCI (or any CI server) would have to have some fairly open permissions, which is a massive security vector. I'd much rather leave the terraform apply step to be done on a developers computer where human interaction is required for having such high priviledges, then have the CI server have a user with very simple permissions (ie. ECR push, task update)

Because of these reasons, it's making it very difficult to use terraform with a CI server when you want to specify the task definition structure within terraform, which I would implore is needed since it would need references to the role created for the task or any other references being used for the infrastructure.

endofcake commented 6 years ago

We have a setup which is similar to what @mboudreau described (although we run Terraform on CI, not on dev machines - but the infrastructure pipeline is completely different to how we deploy the apps).

Here's a blog post about our use case https://devblog.xero.com/ci-cd-with-jenkins-pipelines-part-1-net-core-application-deployments-on-aws-ecs-987b8e032aa0.

gregharvey commented 6 years ago

Has anyone else tried the 'taint' approach described by @JDiPierro above? I've tested it a few times and observed it closely, and I'm sure it destroys all running tasks and then brings up new ones. He said 'ECS waits for them to become healthy and then kills the old containers', but that's not what I'm seeing. Is there some additional config needed?

endofcake commented 6 years ago

@gregharvey , this may depend on how you configured deployment_maximum_percent and deployment_minimum_healthy_percent of your service I'm guessing?

mboudreau commented 6 years ago

@gregharvey it doesn't matter, I don't want to have to taint multiple resources just for my terraform project to work every time I use it.

menego commented 6 years ago

This solved it for me https://forums.aws.amazon.com/thread.jspa?threadID=257234 Basically I had some placement strategy and constraints that prevented the service to kick in the new tasks gracefully (draining the old, installing the new) hence causing the [INACTIVE] issue you are all having. In the end just adding the followings in the service resource definition solved the problem:

deployment_maximum_percent = 100 deployment_minimum_healthy_percent = 50

Esya commented 6 years ago

@menego This is not the issue that we're discussing here. We're discussing how we can have a nice workflow where something external from terraform can create new task definitions, without terraform destroying this task definition the next time it runs, basically.

rjurney commented 6 years ago

I agree that there is an unmet need here and something should be done that hasn't, although I'm not sure what that should be. Full support for the task description in terraform would be nice.

DenisBY commented 6 years ago

I use ecs-deploy script to force re-deployment

Esya commented 6 years ago

Well on my end we've switched to terraform entreprise for a variety of reasons and that actually fixed the issue for us - as we're now changing the docker image through the TF Entreprise API, and applying directly also through the API.

dmitrye commented 6 years ago

Separation of Responsibilities. Terraform is trying to manage the task version of the ECS tasks that are running. Application CICD Pipeline is deploying multiple times causing the task revision to increase (something that can't be avoided with Fargate). Terraform loses site of the task version and coughs up blood. There are a couple of examples where people have bolted on scripts (python, etc) to check the current task version in ECS then set it as a variable for TF to use. Each person seems to have experienced some side effects but it's the closest thing to an answer I've seen in this thread so far.

Switching to Enterprise is great if you have tons of $$$$ but not everyone is on that path.

While ecs-deploy is a good tool and one that I actually use in my pipeline, it's not responsible for maintaining the state of the infrastructure so that's not really an answer either.

cwar commented 5 years ago

With the 1.51.0 update enhancement resource/aws_ecs_task_definition: Support resource import I whipped up a little script that iterates through the ECS tasks that TF knows about checks to see if there is a newer task revision via the AWS API. If there is, the script removes the task from the terraform state file and imports the newer revision.

I want to reiterate that this script manipulates the state file via the TF CLI and is probably somewhat dangerous depending on your use case. I'm also not very good at bash scripting. But, it works and it simplifies my workflow quite a bit. Using this in conjunction with landscape helps me zero in on exactly what is changing when I apply the terraform plan. Use at your own risk.

#!/usr/bin/env bash

# get all the current TF managed ecs task defintions for the cwd state file
TASKS=$(terraform state list | grep aws_ecs_task_definition.task)

for task in $TASKS; do
    # cut out the task name
    TASK_NAME=$(echo $task | cut -d'.' -f2)
    echo "checking $TASK_NAME"

    # grep the TF revision number
    TF_REVISION=$(terraform state show $task | grep revision | grep -Eo '[0-9]{1,}')

    # get the full ARN for the ecs task without the revision
    TASK_ARN=$(terraform state show $task | grep -w task-definition | cut -d'=' -f2 | cut -d':' -f1-6)

    # get the revision of the latest active task from the aws api
    AWS_REVISION=$(aws ecs describe-task-definition --task-definition $TASK_ARN | jq .taskDefinition.revision)

    if [ $TF_REVISION != $AWS_REVISION ]; then
        echo "$TASK_NAME:"
        echo "TF_REVISION is $TF_REVISION"
        echo "AWS_REVISION is $AWS_REVISION"
        echo "does not match, removing $task from the TF state file and importing revision $AWS_REVISION"
        terraform state rm $task
        terraform import $task $TASK_ARN:$AWS_REVISION
    fi
done

maartenvanderhoef commented 5 years ago

Hi everyone, I've created a Lambda to overcome the problem with bootstrapping. The lambda is configured at bootstrap and can be used as a datasource directly. It's not perfect but it works. I've explained the drift detection here.

daniel-krug commented 5 years ago

Here is a terraform-only solution similar to the scripts above. It is lacking support for multiple images in one task-definition and also has some issue with [INACTIVE] task-definitions but maybe helps others as a starter.

Toggle bootstrapping if the service is new, afterwards set this to false.

variable "bootstrapService" {
  description = "set to true if you're creating a new service"
  default     = false
}

Fetch the current containerdefinition (image) and servicedefinition (desired_count).

data "aws_ecs_container_definition" "container-definition" {
  count           = "${var.bootstrapService ? 0 : 1}"
  container_name  = "${var.serviceName}"
  task_definition = "${var.serviceName}"
}

data "aws_ecs_service" "ecs-service" {
  count        = "${var.bootstrapService ? 0 : 1}"
  cluster_arn  = "arn:aws:ecs:<aws_region_here>:<your_accountid_here>:cluster/${var.clusterName}"
  service_name = "${var.serviceName}"
}

Make image of taskdefinition and desired_count accessible for later use.

locals {
  # override passed imagedefinition with currently deployed one if codepipeline has changed it in the meantime
  taskdefinitionImage = ["${var.ecrRepo}", "${data.aws_ecs_container_definition.container-definition.*.image}", "dummyentry"]

  # override passed desired_count with current desired_count if autoscaling has changed it in the meantime
  desired_count = ["${var.desired_count}", "${data.aws_ecs_service.ecs-service.*.desired_count}", "dummyentry"]
}

Template for the task-definition.

data "template_file" "task-definition" {
  template = "${file("${var.task_definition}")}"

  vars {
    image             = "${var.bootstrapService ? local.taskdefinitionImage[0] : local.taskdefinitionImage[1]}"
   <other_stuff_here>
  }
}

aws_ecs_service resource.

resource "aws_ecs_service" "ecs-service" {
  desired_count                      = "${var.bootstrapService ? local.desired_count[0] : local.desired_count[1]}"
  <other_stuff_here>
}

Best regards

bploetz commented 5 years ago

The new target sets feature (https://aws.amazon.com/about-aws/whats-new/2019/03/aws-fargate-and-amazon-ecs-support-external-deployment-controlle/ addressed in #8131 and #8133) helps solve this problem, no?

peterdeme commented 5 years ago

Unfortunately I join those people who need a custom script to achieve this. As the outcome became nearly perfect I'm sharing mine. It's for Windows though (PowerShell script).

I want to support 3 use cases (as 99% of the people I guess):

App deployment where we need a new Docker image
First time deployment (aka bootstrapping) of a task definition, the image can be a dummy one because the very first deployment will override it anyways
Any infrastructure change that is not related to task definitions. In this case I don't want to have a new task definition.

Lookup-TaskDefinition.ps1

param ($taskDefName)

$taskDef = Invoke-Expression "aws ecs describe-task-definition --task-definition $taskDefName" | ConvertFrom-Json

if ($LASTEXITCODE -ne 0 -or $null -eq $taskDef)
{
    return "{ ""image"": """" }"
}

$image = $taskDef.taskDefinition.containerDefinitions[0].image

return "{""image"": """ + $image + """}"

main.tf

resource "aws_ecs_service" "svc" {
    name = "${var.name}-svc"
    cluster = "${var.clusterId}"
    desired_count = "${var.desired_count}"
    task_definition = "${aws_ecs_task_definition.taskdef.arn}"
}

resource "aws_ecs_task_definition" "taskdef" {
    family = "${local.familyName}",
    container_definitions = <<DEFINITION
[
  {
    "name": "${var.name}",
    "image": "${coalesce(local.imageForDeployment, data.external.lookupcurrent.result.image, "bootstrapped")}",
    "memory": ${var.mem_hardlimit},
    "memoryReservation": ${var.mem_softlimit},
    "environment": [{
      "name": "${var.environment_key}",
      "value": "${var.environment_value}"
    }],
    "essential": true,
    "portMappings": [
       {
          "protocol": "tcp",
          "containerPort": ${var.container_port}
        }
    ]
  }
]
DEFINITION
}

data "external" "lookupcurrent" {
  program = ["Powershell.exe", "${path.root}\\modules\\cluster\\scripts\\Lookup-TaskDefinition.ps1 -taskDefName ${local.familyName}"]
}

locals {
  familyName = "${var.name}-${terraform.workspace}"
  imageForDeployment = "${var.image_tag == "" ? "" : "${var.repository_url}:${var.image_tag}"}"
}

Important! data.external.lookupcurrent and aws_ecs_task_definition.taskdef cannot reference each other because there would be a cyclic dependency. The way I resolved it I have introduced a local variable called local.familyName and both the task definition and my lookup script uses it.

Running app deployment: terraform plan -var image_tag=[YourGitHash]

Running non-app deployment: terraform plan

Running app-deployment and infrastructural change at the same time: terraform plan -var image_tag=[YourGitHash]

coalesce is a great function that serves me well for our use case.

And by the way, if data sources could return default values this PowerShell hacky stuff wouldn't be needed. I could imagine something like that

data "aws_ecs_container_definition" "lookupcurrent" {
  task_definition = "${local.familyName}"
  container_name  = "${loca.containerName}"
  errorHandling = {
      suppressError = "true",
      fallbackToEmptyObject = "true"
  }
}

ian-axelrod commented 5 years ago

The new target sets feature (https://aws.amazon.com/about-aws/whats-new/2019/03/aws-fargate-and-amazon-ecs-support-external-deployment-controlle/ addressed in #8131 and #8133) helps solve this problem, no?

This feature looks wonderful, but I do not see how it solves the issue in question. If you create the service and the task set in Terraform, you will still need to create the task definition. If you have a non-Terraform process that updates that task definition, it will do so by creating a new task set, scaling the old task set down, and then updating the service's primary task set to that new task set. Terraform will see this difference, so you end up with the same issue as before.

I am going to theorize from this point onwards. Apologies in advance for the wall of text, and thank you to any that read it and comment. I want to get these thoughts out somewhere that isn't an electronic notepad or blog no one reads.

I do not believe there is a clean way to handle the issue in question without a clean separation of infrastructure and application code attributes, and that is difficult to do unless AWS alters the task definition schema. The approach that multiple folks in this thread take is to update only the image attribute, but I think that there are other task definition attributes that make sense to update outside of Terraform. What about environment variables that are meaningful only to the service, that is, variables that do not reference a Terraform-managed resource? How about docker labels? Command and entrypoint? Secrets?.... that's an interesting one.

I have trouble defining the point at which infrastructure ends for service and task definition attributes. I understand infrastructure as an ontology that does not include application code or things that are meaningful only to that code. adb.sdjfsijfzxc.us-east-2.rds.amazonaws.com is meaningful because it refers to a database. It may only be of use to one service, but it was not designed with that in mind. However, the environment mapping DATABASE_HOST=adb.sdjfsijfzxc.us-east-2.rds.amazonaws.com is only meaningful to the code that interprets it. I can delete or update the mapping and the explicit impact is limited to my own service. If I, as a developer, want to use the mapping MYAPP_DATABASE_HOST=adb.sdjfsijfzxc.us-east-2.rds.amazonaws.com instead, I should be able to do so without thinking about the impact on infrastructure. Maybe I misunderstand infrastructure itself. If so, everything I say from here onwards is invalid.

To me, a service and task as ECS-provided abstractions are infrastructure components, but the attributes of each blur the line. A service that maps to two tasks constrained to 3 EC2 instances of a certain type, with the tasks further constrained in their access to other resources (memory, databases, ...), is infrastructure. The task definition's cpu, memory, memoryReservation, and network setting attributes are all attributes of the task definition as an infrastructure component. These have an explicit relationship to other infrastructure components, such as the EC2 instance, though I don't think Terraform enforces these relationships. Same with task (execution) role arn. It has explicit relationships with other infrastructure components. environment, however, is not an infrastructure attribute, even if contains references to Terraform-managed resources. image does not strike me as an infrastructure attribute. As for services, I think their task definition revision attribute is not an attribute of an infrastructure component because task definitions are 'hybrid' abstractions.

I think there should be a 'task constraints', 'task parameters', or 'task meta' concept (not task sets), and that is what services point to and Terraform manages. This entity would define resources available to individual containers and resources that all containers may share. It would place constraints on the containers, such as max memory consumption, and docker security options (e.g., no-new-privileges). What it would not do is specify what the containers are. During deployment, you create a task definition* (environment [which may reference resources that the task meta exposes] , image, command, anything meaningful only to your application) and connect it to the appropriate task constraints... like filling in available slots, or you just update an existing task definition without changing its attachments. This setup is more complex on the surface; however, if it correctly models my understanding of the infrastructure-service boundary better than what exists now, and it eliminates the issues in question, then it reduces complexity. At least for terraformers ;).

Anyway, I am rambling. I am a novice Terraformer, so perhaps I have completely misunderstood the tool. Thanks again to any that actually read this.

bploetz commented 5 years ago

I just implemented blue/green deployments via CodeDeploy for my ECS service managed by Terraform, and just wanted to note what worked for me in case it helps someone else.

The aws_ecs_task_definition is set up to point at the image tagged "latest", such that if we ever need to destroy/re-create the resource, it's always pointed at the latest and greatest image.
Our CI/CD pipeline triggers a CodeDeploy deployment using the AWS CLI tools.
The aws_alb_listener resource is configured to ignore changes to default_action, for when the target group is switched between blue and green during a deployment.

resource "aws_alb_listener" "my_service" {
  lifecycle {
    ignore_changes = ["default_action"]
  }
}

The aws_ecs_service resource is configured to ignore changes to load_balancer, task_definition, and service_registries (if you're using service discovery), as these all get touched during a CodeDeploy deployment.

resource "aws_ecs_service" "my_service" {
 deployment_controller {
    type = "CODE_DEPLOY"
  }

  lifecycle {
    create_before_destroy = true
    ignore_changes = ["load_balancer","task_definition","service_registries"]
  }
}

Once I made these changes on the Terraform side, if I run terraform plan after a CodeDeploy deployment, Terraform says there's nothing to do for the service in question.

HTH.

beephotography commented 5 years ago

@bploetz That really sounds like a good solution, but I see one pitfall though: the terraform.tfstate file is not updated to the new revisions. You may run into trouble if (in some situation) terraform needs to know the current revision (etc.). What do the others think, are my concerns reasoned? I'm really not sure :thinking:

bploetz commented 5 years ago

@beephotography Yup, that's basically the decision you're making with this approach: Terraform can create/destroy resources, but after creation they belong to someone else to manage. I can't think of any situations where the out of date state file would be a problem given the resources in question, but even if I needed to change something fundamental about the set up such that I needed an up to date state file, at that point I would likely just bring up new resources in parallel with the changes, and then deprecate/destroy the old resources.

I may be missing something obvious with this approach, and if so, I'm sure someone will set me straight. :)

jfirebaugh commented 5 years ago

@endofcake Thanks for outlining the central issue:

Say, the last revision it knows about is 36. This is the version stored in it remote state, it's also the revision used by the ECS service as far as Terraform is concerned. The currently active revision is meanwhile 38, and it uses a new Docker image. With the workarounds like above Terraform is able to grab this image version by describing the current ECS task definition, but it then tries to create a new task revision with it, which in turn triggers a redeployment of the ECS service.

Everything else can be worked around with strategic use of data "aws_ecs_task_definition" and data "aws_ecs_container_definition", but I cannot figure out how to get terraform not to force creation of useless new task revisions.

Your blog post doesn't seem to explain how this is solved either. Did you figure it out?

alevinetx commented 5 years ago

Has anyone found a viable solution to this, short of using external scripts to update-template-in-place ? I've tried using the lifecycle ignore_changes "*" on ecs_task_definition, but either that's not far reaching enough ( "As an example, this can be used to ignore dynamic changes to the resource from external resources. Other meta-parameters cannot be ignored." ), or it's not fully implemented; I wouldn't quite know where to look to see.

FWIW, we're still on .11. If moving to .12 will help us reach this goal, then that's what we'll prioritize.

Any insight is greatly appreciated.

jfirebaugh commented 5 years ago

@alevinetx We recently migrated from 0.11 to 0.12. Lots of nice improvements, but unfortunately this issue isn't one of them.

riwiki commented 4 years ago

After researching a lot, I found the following to be a promising way within terraform alone. The goal: if terraform scripts are executed locally, leave the image version/tag unchanged (thus not updating the ECS service's task definition), but allow to specify a specific (new) image version to be deployed which then updates the task definition, e.g. to be used in CI/CD or when rolling back to a specific version manually.

variable "app_version" {
  type = number
  default = 0
}

data "aws_ecs_task_definition" "active" {
  task_definition = local.task_family
}

data "aws_ecs_container_definition" "active" {
  task_definition = data.aws_ecs_task_definition.active.id
  container_name = "containername"
}

locals {
  task_family = "task name"
  # digest is the image's version/tag
  new_version = var.app_version > 0 ? var.app_version : data.aws_ecs_container_definition.active.image_digest
}

resource "aws_ecs_service" "apps-service" {
  task_definition = "${aws_ecs_task_definition.task.arn}"
  ...
}

resource "aws_ecs_task_definition" "task" {
  family                = local.task_family
  container_definitions = "${data.template_file.container_definitions.rendered}"
}

data "template_file" "container_definitions" {
  template = file("${path.module}/templates/task-definition.json")
  vars = {
    app_version = "${local.new_version}"
  }
}

This does not cover that initially no task definition is present -- as I have no stage to test it currently -- but maybe someone could add sufficient logic. Also, I am not sure what terraform version this requires. If you have version >= 0.12.7 you can use the regex function to handle alphanumeric image digests. I hope it gives you all necessary pointers. Otherwise, please poke me and I will elaborate a bit.

Approaches that did not work, and why

add lifecycle ignore_changes conditionally. No conditions are allowed within lifecycle.
add overrides.tf to mask the ignore_changes section. Cannot access resources that are nested within a module. Also, needs some cumbersome file management to move/unmove the overrides.tf file depending on use case.
external data source; although feasible, requires dependencies, e.g. for parsing json. Seems rube-goldberg-machinish. Also, it makes it hard for me to pass my aws provider's profile down to the script, as I use profiles to manage different environments.
work with terraform taint. Add a ignore_changes block to the task definition resource regarding its container_definitions and have the resource built with new versions by tainting it beforehand. This introduces a race condition!! So deploying a new version or not deploying a new version solely depends on whether a taint operation was performed beforehand -- and you can only hope that the very same context had executed it deliberately. So better not use it. Also, I need to define the resources to be tainted anew.

Side note

I was somehow confused in this conversation. Some people seem to only want to make sure that terraform updates the ECS service definition with the latest task definition revision when that task definition is changed. For me, terraform 0.12 handled this just fine: whenever I changed the task definition, the resource needed to be replaced and the service needed to be updated because of the task-resource's arn would change. I performed deploys only via terraform, so there is e.g. no manual deployment via the AWS console. Maybe that was the aspect I missed.

Sytten commented 4 years ago

The problematic code for us is really this line: https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_ecs_task_definition.go#L417 I strongly feel that terraform should fetch the latest version of the task definition and then compare it, not just declare the resource destroyed.

EDIT: The more I look at it, the more it seems to me the current way is clearly wrong. If someone unregisters a revision, it doesn't mean the resource has been deleted or is not present anymore.

nickdani commented 4 years ago

I'm facing the same issue.

In short:

I have use terraform only for creating/deleting resources
I use TravisCI + ecs-deploy to build and deploy new image
In ECS I keep only a few task definitions

The problem: after a few deployments the old task definitions created by terraform became INACTIVE and terraform tries to create a new task definition.

Based on all the comments I don't see any elegant solution for that, except having a flag "bootstrap" - which became inactive as soon as you created a task definition at the beginning, although you still have a problem if you need to update task definition in terraform(for example add env variables) - and then it become messy.

I agree with @Sytten and I think it would be great if it would be possible to mark somehow task definition to use the last revision or mark resource like "create_once".

As some people pointed out, current implementation simply makes impossible to use Terraform + ECS + CI/CD

hashicorp / terraform-provider-aws