When many ASGs are created, their ec2 instances are mixed up

Dmitry1987 commented 4 years ago

We recently started to use this module to create 50-60 ASGs and the result is many ec2 instances have their tags wrong, so groups think they 'own' some ec2 servers but these are wrong ones, and ASGs don't create their own server when they think of other ones as "theirs".

Result of a run is, out of 64 ASGs with 'min=1/max=1/desired=1' capacity, 45-50 ec2 instances are created and random ones are not. It's repeatable and I thought is a race condition in AWS so we tried to understand it with AWS support but they pointed out they saw terraform API calls to assign tags to ec2 instances after ASGs were created, and these 'assign tag' calls are the reason for mixed up servers and groups.

Any ideas how this could happen? Does the module run additional steps to tag an ec2 instance? (why is it required? the ASG will assign its own tags to the ec2 servers when it spawns them as I understand)

The only solution was to chain the groups with "depends_on" sequentially but it makes the process slow, and also updates are difficult in that manner. What would you suggest, maybe remove the 'tags' completely from these groups and apply tags later by a custom script or an additional 'terraform apply' pass after all ASGs and servers are already up?

antonbabenko commented 4 years ago

Hi Dmitry!

Please show the code you are using to create at least a couple of ASG (specifically, how unique are name, tags, and tags_as_map values).

Dmitry1987 commented 4 years ago

Hi Anton, thanks for helping each time with my issues and requests, I really appreciate that :) .

it's autogenerated from a template, but here are 2 out of the many (they are launched with NLBs and cloudwatch alarms for autoscaling). The words 'something' and 'else' here exactly replace internal names of algorithms, so the number of words in tags is exactly like shown below. This run will end up with some ASGs "thinking" that they have a server belong to them, when actually in reality it will be someone elses server and that "someone elses" server when I click on it, will not even have the name of a group that points to it (so the ASG will show it "has" a server in list and will show instance ID, but when you click on it, the instance appears to be of completely different server belongs to another ASG from that batch of terraform resources)


module "algo24_asg" {
  source     = "terraform-aws-modules/autoscaling/aws"
  version    = "3.7.0"
  depends_on = [module.algo23_asg, module.algo24_nlb]
  name       = "service-something-${var.environment}"
  create_asg = var.ocr.min > 0 ? true : false
  # Launch configuration
  lc_name = "algo24-${var.environment}"
  # Do not trigger recreation of all ASG if launch config settings like user-data change
  recreate_asg_when_lc_changes = var.recreate_on_change
  image_id                     = data.aws_ami.ubuntu_algorithms.id
  instance_type                = "m5.large"
  key_name                     = "xxxxx"
  service_linked_role_arn      = "xxxxxx"
  iam_instance_profile         = local.iam_profile
  termination_policies         = ["OldestInstance"]
  health_check_grace_period    = 600
  suspended_processes          = ["ReplaceUnhealthy"]

  security_groups = var.use_feature_vpc ? [data.terraform_remote_state.feature-vpc.outputs.peering_sg_id, data.terraform_remote_state.feature-vpc.outputs.private_sg_id] : [aws_security_group.peering[0].id, aws_security_group.private[0].id]

  root_block_device = [
    {
      encrypted   = true
      volume_size = 100
      volume_type = "gp2"
    },
  ]

  asg_name                  = "algo24-${var.environment}"
  vpc_zone_identifier       = var.use_feature_vpc ? data.terraform_remote_state.feature-vpc.outputs.private_subnets : module.vpc.private_subnets
  health_check_type         = "EC2"
  min_size                  = 1
  max_size                  = 10
  desired_capacity          = null
  wait_for_capacity_timeout = 0
  target_group_arns         = module.algo24_nlb.target_group_arns

  user_data = templatefile("${path.module}/templates/user-data-algorithms.sh.tmpl", merge(
    local.facts_for_user_data_script,
    {
      host_type        = "algo-service-something",
      docker_image     = "algo-service-something-production:latest",
      git_repo         = "none",
      ansible_cicd_dir = "none"
    })
  )

  tags_as_map = merge(
    var.tags,
    map("Ansible_group", "group_service-something_${var.environment}"),
    map("Ansible_service", "service-something"),
    map("Ansible_env", var.environment)
  )
}

resource "aws_autoscaling_policy" "algo25_scale_up" {
  count                  = var.ocr.min > 0 ? 1 : 0
  name                   = "algo25-${var.environment}-scale-up-policy"
  scaling_adjustment     = 2
  adjustment_type        = "ChangeInCapacity"
  cooldown               = 120
  autoscaling_group_name = module.algo25_asg.this_autoscaling_group_name
}

resource "aws_autoscaling_policy" "algo25_scale_down" {
  count                  = var.ocr.min > 0 ? 1 : 0
  name                   = "algo25-${var.environment}-scale-down-policy"
  scaling_adjustment     = -1
  adjustment_type        = "ChangeInCapacity"
  cooldown               = 600
  autoscaling_group_name = module.algo25_asg.this_autoscaling_group_name
}

resource "aws_cloudwatch_metric_alarm" "algo25_scale_up" {
  count                  = var.ocr.min > 0 ? 1 : 0
  alarm_name          = "algo25-${var.environment}-scale-up-cw-alarm"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "60"
  statistic           = "Average"
  threshold           = "60"

  dimensions = {
    AutoScalingGroupName = module.algo25_asg.this_autoscaling_group_name
  }

  alarm_description = "This metric monitors ec2 cpu utilization"
  alarm_actions     = [aws_autoscaling_policy.algo25_scale_up[0].arn]
}

resource "aws_cloudwatch_metric_alarm" "algo25_scale_down" {
  count                  = var.ocr.min > 0 ? 1 : 0
  alarm_name          = "algo25-${var.environment}-scale-down-cw-alarm"
  comparison_operator = "LessThanOrEqualToThreshold"
  evaluation_periods  = "15"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "60"
  statistic           = "Average"
  threshold           = "10"

  dimensions = {
    AutoScalingGroupName = module.algo25_asg.this_autoscaling_group_name
  }

  alarm_description = "This metric monitors ec2 cpu utilization"
  alarm_actions     = [aws_autoscaling_policy.algo25_scale_down[0].arn]
}

resource "aws_autoscaling_lifecycle_hook" "algo25" {
  count                  = var.ocr.min > 0 ? 1 : 0
  name                   = "algo25-${var.environment}-scale-down-lifecycle-hook"
  autoscaling_group_name = module.algo25_asg.this_autoscaling_group_name
  default_result         = "CONTINUE"
  heartbeat_timeout      = 600
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
}

module "algo25_asg" {
  source     = "terraform-aws-modules/autoscaling/aws"
  version    = "3.7.0"
  depends_on = [module.algo24_asg, module.algo25_nlb]
  name       = "service-something-else-${var.environment}"
  create_asg = var.ocr.min > 0 ? true : false
  # Launch configuration
  lc_name = "algo25-${var.environment}"
  # Do not trigger recreation of all ASG if launch config settings like user-data change
  recreate_asg_when_lc_changes = var.recreate_on_change
  image_id                     = data.aws_ami.ubuntu_algorithms.id
  instance_type                = "m5.large"
  key_name                     = "xxxxx"
  service_linked_role_arn      = "xxxxxx"
  iam_instance_profile         = local.iam_profile
  termination_policies         = ["OldestInstance"]
  health_check_grace_period    = 600
  suspended_processes          = ["ReplaceUnhealthy"]

  security_groups = var.use_feature_vpc ? [data.terraform_remote_state.feature-vpc.outputs.peering_sg_id, data.terraform_remote_state.feature-vpc.outputs.private_sg_id] : [aws_security_group.peering[0].id, aws_security_group.private[0].id]

  root_block_device = [
    {
      encrypted   = true
      volume_size = 100
      volume_type = "gp2"
    },
  ]

  # Auto scaling group
  asg_name                  = "algo25-${var.environment}"
  vpc_zone_identifier       = var.use_feature_vpc ? data.terraform_remote_state.feature-vpc.outputs.private_subnets : module.vpc.private_subnets
  health_check_type         = "EC2"
  min_size                  = 1
  max_size                  = 10
  desired_capacity          = null
  wait_for_capacity_timeout = 0
  target_group_arns         = module.algo25_nlb.target_group_arns

  user_data = templatefile("${path.module}/templates/user-data-algorithms.sh.tmpl", merge(
    local.facts_for_user_data_script,
    {
      host_type        = "algo-service-something-else",
      docker_image     = "algo-service-something-else-production:latest",
      git_repo         = "none",
      ansible_cicd_dir = "none"
    })
  )

  tags_as_map = merge(
    var.tags,
    map("Ansible_group", "group_service-something-else_${var.environment}"),
    map("Ansible_service", "service-something-else"),
    map("Ansible_env", var.environment)
  )
}

antonbabenko commented 4 years ago

I don't see anything very obviously wrong here except that there are names like something and something-else which makes me think that both may match to something group of resources.

Can you set min_size = 0, desired_capatity = null and verify that launch configurations are created properly?

Dmitry1987 commented 4 years ago

Yes, the launch_configs are good and there's a correct number of them with proper tags, because if I click "refresh instances" to force ASG to recreate ec2, they will all recreate a correct ec2 instance, but that is still an issue for production, because I only can run such destructive thing in our dev. "rolling" the 'wrong' instances and create new ones, makes this all refreshed and every ASG will have a correct ec2 server. But the concern is, after I run terraform apply one more time (adding/removing groups massively, for example I had 26 types of algorithms but we enable 64 types and I am regenerating all groups and applied the 64, the new ones which were added - came up mixed like in the initial creation, 5-6 of them were messed up).

Adding more groups will happen about once a month for me in production so I'm thinking how to make sure I won't need to "roll" the servers each time we add groups. There's some race condition maybe, but even after I chained them with depends_on, they still come up with wrong pointing to services.

Regarding naming convention, the names in mixed up servers and their "kinda owners" groups, are completely different, for example "governing-law" ASG will point to "termination-agreement" ec2 server (from completely different group) I noticed that when it happened first time as well, because as I know ASGs use tags for ownership information ("aws:autoscaling:groupName") so paid attention to that as well. This is a totally weird issue. I will bring in some screenshots after I run the experiment in dev a few more times :)

Dmitry1987 commented 4 years ago

By the way is it possible to organize ASG in a way that 2 instances will be on-demand pricing tier as usual, and all other instances will be autoscaled as spot pricing? I am trying to have a baseline of 2 servers for HA plus autoscaling up to N instances in each group. I remember doing that with Spotinst service, but wonder if it's possible to configure with vanilla ASGs and this module (or by forking the module and adding changes)?

Seems like the only way is to create 2 ASGs per service, one with on-demand instances that will always have the min=2/max=2/desired=2, and another will be spot pricing, with same AMI and all settings, just spot, and it will be the only one to autoscale. What do you think?

antonbabenko commented 4 years ago

You are right, this module does not support launch templates and advanced spot scenarios like you are describing, so you will need to have 2 ASGs per service.

Dmitry1987 commented 3 years ago

was able to solve by naming ASGs using hashes and not ordered numbers, it makes creation and upgrades easier. the mix up still can happen but only during first run of many groups creation, it helps to rotate all instances when it happens and they get to correct order. all future apply steps are smaller at once, with static names for existing groups, and it helps

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

terraform-aws-modules / terraform-aws-autoscaling

When many ASGs are created, their ec2 instances are mixed up #122