terraform-aws-modules / terraform-aws-autoscaling

Terraform module to create AWS Auto Scaling resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/autoscaling/aws
Apache License 2.0
286 stars 552 forks source link

iam_role_policies are detached before the autoscaling group is deleted #268

Open chrisbecke opened 1 month ago

chrisbecke commented 1 month ago

Description

Given an autoscaling group that defines lifecycle hooks and attaches a role policy of AutoScalingFullAccess...

  initial_lifecycle_hooks = [
    {
      name                 = "StartupLifeCycleHook"
      lifecycle_transition = "autoscaling:EC2_INSTANCE_LAUNCHING"
    },
    {
      name                 = "TerminationLifeCycleHook"
      lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING"
    }
  ]

  iam_role_policies = {
    AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
    AWSAutoScalingFullAccess     = "arn:aws:iam::aws:policy/AutoScalingFullAccess"
  }

If the terraform is destroyed, or if an update triggers a create-before-destroy on the autoscaling group, terraform will destroy the aws_iam_role_policy_attachments before the aws_autoscaling_group has been destroyed.

The aws_autoscaling_group will be waiting for the EC2_INSTANCE_TERMINATING lifecycle hook to finish on each instance to proceed, and if the instances process their own lifecycle hooks to shut down, they will need an attached instance roke policy with "autoscaling:CompleteLifecycleAction".

This introduces a race condition where Instances have their instance role policies stripped before the instances can finish any kind of cleanup.

Versions

Reproduction Code [Required]

module "managers" {
  count  = var.create ? 1 : 0
  source = "terraform-aws-modules/autoscaling/aws"

  name            = local.manager_cluster_name
  use_name_prefix = false

  min_size         = 3
  max_size         = 5
  desired_capacity = 3
  #  wait_for_capacity_timeout = 0
  #  health_check_type         = "EC2"
  vpc_zone_identifier = var.subnet_ids
  security_groups     = var.security_groups
  target_group_arns   = var.target_group_arns

  initial_lifecycle_hooks = [
    {
      name                 = "StartupLifeCycleHook"
      default_result       = "CONTINUE"
      heartbeat_timeout    = 300
      lifecycle_transition = "autoscaling:EC2_INSTANCE_LAUNCHING"
    },
    {
      name                 = "TerminationLifeCycleHook"
      default_result       = "CONTINUE"
      heartbeat_timeout    = 300
      lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING"
    }
  ]

  # Launch template
  launch_template_name            = "${var.cluster_name}-manager"
  launch_template_use_name_prefix = false
  launch_template_description     = "Swarm Manager Launch Template"
  update_default_version          = true

  image_id          = var.control_plane_ami
  instance_type     = var.control_plane_instance_type
  ebs_optimized     = true
  enable_monitoring = true

  create_iam_instance_profile = true
  iam_role_name               = "${var.cluster_name}-manager"
  iam_role_use_name_prefix    = false
  iam_role_path               = "/ec2/"
  iam_role_description        = "Allow Node to register with SSM"
  #  iam_role_tags = {
  #    CustomIamRole = "Yes"
  #  }

  iam_role_policies = {
    AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
    AWSAutoScalingFullAccess     = "arn:aws:iam::aws:policy/AutoScalingFullAccess"
  }

  user_data = data.template_cloudinit_config.manager.rendered
}

Steps to reproduce the behavior:

  1. Deploy the autoscaling group
  2. Change the name to force a create_before_destroy

Expected behavior

The Instances can react to the lifecycle hook (A bash script for example) and respond using the aws cli to CONTINUE the lifecyclehook

Actual behavior

The instances cannot CONTINUE the lifecycle hook with a role-or-policy does not exist error.

chrisbecke commented 1 month ago

Thinking about it some more:

resource "aws_iam_role" "this" {
  name               = "my-role"
  assume_role_policy = ""
}

resource "aws_iam_role_policy_attachment" "this" {
  count      = 1
  policy_arn = "arn:aws:iam::aws:policy/AutoScalingFullAccess"
  role       = aws_iam_role.this.name
}

resource "aws_iam_instance_profile" "this" {
  name = "my-instance-profile"
  role = aws_iam_role.this.name
}

resource "aws_launch_template" "this" {
  iam_instance_profile {
    name = aws_iam_instance_profile.this.name
  }
  // This is needed to ensure the roles associated with the instance profile are attached before template instances need them.
  depends_on = [ aws_iam_role_policy_attachment.this ]
}

resource "aws_autoscaling_group" "this" {
  name     = "my-asg"
  max_size = 1
  min_size = 1
  launch_template {
    id = aws_launch_template.this.id
  }
}

Without the depends_on, there is a race condition - instances cannot depend on the policy attachments existing when the infrastructure is being created or destroyed.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days