Autoscaling group already exists after failure due to AWS limits

hashibot commented 7 years ago

This issue was originally opened by @dmikalova as hashicorp/terraform#9332. It was migrated here as part of the provider split. The original body of the issue is below.

Terraform Version

0.7.4

Affected Resource(s)

aws_autoscaling_group

Terraform Configuration Files

resource "aws_launch_configuration" "mod" {
  name_prefix = "${var.tags["product"]}-${var.tags["env"]}-${var.tags["service"]}-${var.tags["component"]}-${var.lc_ami_timestamp}-"

  image_id             = "${var.lc_ami_id}"
  instance_type        = "${var.lc_instance_type}"
  iam_instance_profile = "${var.lc_iam_instance_profile_id}"
  key_name             = "${var.lc_key_name}"
  security_groups      = ["${split(",", var.lc_security_groups)}"]

  user_data         = "${var.lc_user_data_file}"
  enable_monitoring = false

  root_block_device {
    volume_size = "${var.lc_root_block_device_volume_size}"
    volume_type = "${var.lc_root_block_device_volume_type}"
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_autoscaling_group" "mod" {
  # This causes LCs and ASGs to stay in sync.
  name = "${aws_launch_configuration.mod.name}"

  availability_zones   = ["${var.asg_availability_zones}"]
  vpc_zone_identifier  = ["${var.asg_subnet_ids}"]
  launch_configuration = "${aws_launch_configuration.mod.id}"
  min_size             = "${var.asg_min_size}"
  max_size             = "${var.asg_max_size}"

  termination_policies = ["${split(",", var.asg_termination_policies)}"]
  load_balancers       = ["${split(",", var.asg_elb_names)}"]

  lifecycle {
    create_before_destroy = true
  }
}

Debug Output

* aws_autoscaling_group.mod: Error creating AutoScaling Group: AlreadyExists: AutoScalingGroup by this name already exists - A group with the name once-dev-playback-nginxplus-20161005Z031807-20161011214752196348605jbs already exists

Important Factoids

These are create before destroy, with the ASG name being set by the launch configuration. The desired count is never reached because AWS instance type limits for this region were being hit.

Expected Behavior

First run:

Terraform successfully creates LC.
Terraform creates ASG, and stores the fact that it was created.
Terraform waits for ASG to reach desired count.
Terraform fails because desired count is never reached.

Second run:

Terraform removes the ASG that was created if it still has not reached desired count.
Steps 2-3 above.
If the limit was lifted, success, if not, fail again.

No manual intervention is necessary.

Actual Behavior

First run:

Terraform successfully creates LC.
Terraform creates the ASG.
Terraform waits for ASG to reach desired count.
Terraform fails because desired count is never reached.

Second run:

Terraform fails because it attempts to create another ASG with the same name as above.

If the limit is lifted, manual intervention is necessary to remove the old ASG - terraform forgot about the ASG that it created and leaves behind cruft.

Steps to Reproduce

terraform apply to create a create before destroy ASG with same name as its LC and prevent the desired count from being reached.

dvianello commented 6 years ago

Just being badly affected by this one when a wait_for_elb_capacity took too long to report back healthy.

Is there any known workaround that we can implement to avoid this from biting us waiting for the fix?

Thanks!

tamsky commented 6 years ago

I experienced this as well and documented what I believe is the root cause in this issue's parent:

https://github.com/hashicorp/terraform/issues/9332#issuecomment-304717880

The rest of this answer assumes: lifecycle { create_before_destroy = true } is in use.

This issue stems from the fact that the ASG name does not change across execution cycles. (This holds true for folks that use name_prefix for their Launch Configs and name their ASG equal to the generated Launch Config name -- the LC name does not change, so their ASG name is also unchanged.)

ASG name values do not magically change after an apply failure. ASG name values do not change even if marked tainted. When the ASG name is declared as:

resource "aws_autoscaling_group" "mod" {
  # This causes LCs and ASGs to stay in sync.
  name = "${aws_launch_configuration.mod.name}"

then the thing to do is to cause the LC name to change.

There are at least 3 workarounds for manipulating the Launch Config state after failure. See [1].

Another workaround (not mentioned in [1]) is to manually force the name of the resource to change, after failure:

One implementation of that could look like:

resource "aws_autoscaling_group" "main" {
    lifecycle { create_before_destroy = true }
    name                      = "${aws_launch_configuration.main.name}"
...
}
resource "aws_launch_configuration" "main" {
    lifecycle { create_before_destroy = true }
    name_prefix                 = "${var.lc_failed_deploy_counter == "" ? "" : "${var.lc_failed_deploy_counter}" }-"
...
}

To use the workaround, simply set lc_failed_deploy_counter to have a non-empty value during your next plan, and you should notice that terraform will create a new ASG+LC pair, and only after successfully creating the ASG+LC during apply, will it destroy all previous pairs of ASG+LC.

[1] : https://github.com/terraform-providers/terraform-provider-aws/issues/2438#issuecomment-354368383 and https://github.com/terraform-providers/terraform-provider-aws/issues/2438#issuecomment-426391266

bflad commented 4 years ago

Hi folks 👋 There are a lot of potential factors to this type of issue and have been a lot of changes to the Terraform CLI and Terraform AWS Provider codebases since it has been filed awhile ago. If there are still lingering issues present on recent versions (Terraform CLI 0.12.28 and Terraform AWS Provider 2.69.0 are the latest as of this writing), please do file a new bug report and we can take a fresh look. Thanks so much.

ghost commented 4 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

hashicorp / terraform-provider-aws