aws_ecs_capacity_provider: Tag AmazonECSManaged doesn't get assigned to initial EC2 instances

Leonidimus commented 4 years ago

We started using AWS Capacity Providers and now see the following issue: The ASG gets created and linked with Capacity Provider just fine, however it never scales down. Amazon support spotted that some EC2 instances are not tagged with AmazonECSManaged tag which is required for instances to properly register with a Capacity Provider. All the untagged instances are the ones launched at ASG creation time; subsequently launched EC2s are tagged properly.

I think it could happen due to ASG being created and populated with EC2s first, and then linked with a Capacity Provider - that would leave already launched instances untagged. The proper sequence would be to create ASG with min_size=0, link with Capacity Provider, then set min_size=N.

The problem with it is The ASG never scales down, and also incorrect CapacityProviderReservation CloudWatch metric calculation.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

0.12.19

Affected Resource(s)

aws_ecs_capacity_provider
aws_autoscaling_group

Terraform Configuration Files

resource "aws_ecs_service" "ecs_service" {
  name            = var.service_name
  cluster         = aws_ecs_cluster.service_ecs_cluster.id
  task_definition = aws_ecs_task_definition.service_ecs_task_definition.arn
  iam_role        = aws_iam_role.ecs_service_role.arn
  desired_count   = var.autoscale_ecs_desired
  depends_on = [
    aws_iam_role_policy.ecs_service_role_policy,
    aws_alb_listener.alb-https,
    aws_elasticache_replication_group.redis_rg_1,
  ]

  lifecycle {
    ignore_changes = [desired_count]
  }

  ordered_placement_strategy {
    type  = "binpack"
    field = "cpu"
  }

capacity_provider_strategy {
    capacity_provider = aws_autoscaling_group.service_asg.name
    weight = 1
    base = 1
  }

  load_balancer {
    target_group_arn = aws_alb_target_group.api.arn
    container_name   = var.service_name
    container_port   = var.service_port
  }
}

resource "aws_ecs_capacity_provider" "service_cap_prov" {
  name = aws_autoscaling_group.service_asg.name

  auto_scaling_group_provider {
    auto_scaling_group_arn         = aws_autoscaling_group.service_asg.arn
    managed_termination_protection = "ENABLED"

    managed_scaling {
      maximum_scaling_step_size = 5
      minimum_scaling_step_size = 1
      status                    = "ENABLED"
      target_capacity           = 90
    }
  }
}

resource "aws_autoscaling_group" "service_asg" {
  # Name should be dynamic from LC because we want ASG to update when LC updates
  name                  = "${var.service_name}-${var.service_env}-${var.service_version}-${aws_launch_configuration.service_lc.name}"
  min_size              = var.autoscale_min
  max_size              = var.autoscale_max
  health_check_type     = "EC2"
  launch_configuration  = aws_launch_configuration.service_lc.name
  target_group_arns     = [aws_alb_target_group.api.arn]
  protect_from_scale_in = true

  vpc_zone_identifier = split(",", var.aws_internal_subnets)
  lifecycle {
    create_before_destroy = true
  }

  tag {
    key                 = "Name"
    value               = "${var.service_name}-${var.service_env}-${var.service_version}"
    propagate_at_launch = true
  }
  tag {
    key                 = "Department"
    value               = var.tag_department
    propagate_at_launch = true
  }
}

resource "aws_launch_configuration" "service_lc" {
  image_id             = data.aws_ami.service_ami.id # dynamic (latest) image
  instance_type        = var.aws_instance_type
  security_groups      = [aws_security_group.service_security_group.id]
  iam_instance_profile = aws_iam_instance_profile.ecs.name
  key_name             = var.key_name
  user_data            = data.template_file.service_ecs_user_data.rendered
  lifecycle {
    create_before_destroy = true
  }
}

Expected Behavior

ASG scales up and down with Capacity Provider linked

Actual Behavior

ASG scales up but never down because AmazonECSManaged is not assigned to EC2 instances launched when ASG was created.

Steps to Reproduce

Create ASG, ECS service and Capacity Provider with Terraform configuration snippets above

Important Factoids

Terminating untagged EC2 instances manually fixes the issue - ASG starts to scale down. However, it's not a feasible workaround due to a high number of deployments.

peter-boekelheide-ah commented 4 years ago

This seems to be because terraform doesn't add AmazonECSManaged as a propagated tag to the ASG itself when it links the capacity provider. There's a workaround for this by adding the following to your ASG configuration:

resource "aws_autoscaling_group" "service_asg" {
        ...
        tag {
            key                 = "AmazonECSManaged"
            propagate_at_launch = true
        }
        ...
}

Making this change worked for me.

Leonidimus commented 4 years ago

@peter-boekelheide-ah did Scale-in work for you? I tried that workaround a couple of weeks ago and although AmazonECSManaged tag was assigned to EC2 instances, the ASG was stuck in "Desired=2" and actual number running = 3 with all 3 instances still having Scale-in protection flag enabled. Maybe there is something different in the magical process when AmazonECSManaged tag is created by the Capacity Provider - we can only guess the internal logic.

peter-boekelheide-ah commented 4 years ago

@Leonidimus I was able to get scale in and out working. Mind you my ASG was set to a starting/minimum size of 0, so my use case may have been different to yours. But currently my ASG properly scales with the demands of my cap provider.

One thing that I had to do (not sure if it mattered or if was just part of my magic chicken dance) was that I needed to use the actual resource reference to my cap_provider.name as my ecs_cluster definition's capacity_provider. When I first encountered the issue of the cycle from cluster->cap_provider->asg->launch_template->cluster, I used the actual string for name of the cap provider in my ecs cluster resource at, and this led to issues with the ASG's target tracking policy not being set up, among other things. I changed my launch template to instead use the ecs_cluster's string name in its user data to avoid the cycle and then referred to the cap provider directly in my ecs cluster resource, and that seems to have fixed that issue.

Not sure if that helps. YMMV. But I was able to finally get it working after some fiddling and gnashing of teeth.

bflad commented 4 years ago

Hi @Leonidimus and other folks 👋 Thanks for raising this.

In general, Terraform and the Terraform AWS Provider does not make any presumptions about infrastructure provisioning beyond what is directly configured. Any inherent behaviors or configuration created by layering resources on top of others must usually be accounted for in the Terraform configuration. In this case since the ECS API automatically adds the AmazonECSManaged tag to the Auto Scaling Group when associated, the Auto Scaling Group configuration must either include that tag's configuration so its available immediately to any initial EC2 Instances and so Terraform does not try to remove it later on or there may be workarounds such as ignore_changes to prevent Terraform from showing the tag removal as a difference. The latter can potentially cause issues similar to the original report here though, so the small documentation note mentioning ignore_changes with the AmazonECSManaged tag will be replaced with the configuration inclusion recommendation for clarity.

The general preference in this case should be pre-configuring the AmazonECSManaged tag within the aws_autoscaling_group resource, so its propagated automatically to initial EC2 Instances when min size is greater than 0 on creation (as mentioned above), e.g.

resource "aws_autoscaling_group" "example" {
  # ... other configuration, potentially including other tags ...

  tag {
    key                 = "AmazonECSManaged"
    propagate_at_launch = true
  }
}

Any EC2 Instances as part of the Auto Scaling Group that do not have the tag can, as mentioned above, have unexpected behavior with respects to scaling. Since the original issue mentioned should be resolvable with a configuration update, but we would like to add extra documentation on this manner in the aws_ecs_capacity_provider resource documentation, I'm going to leave this issue open until those documentation changes are merged.

ghost commented 4 years ago

This has been released in version 3.0.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

ghost commented 4 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

hashicorp / terraform-provider-aws