hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.74k stars 9.1k forks source link

[Bug]: ASG Instance Refresh Rollback Impaired behavior #34189

Open adeeb-ma opened 10 months ago

adeeb-ma commented 10 months ago

Terraform Core Version

1.0.5

AWS Provider Version

5.32.1

Affected Resource(s)

aws_autoscaling_group

Expected Behavior

Applying the Autoscaling group with Instance Refresh enabled (i.e. changing launch template version, for example) would trigger the instance refresh to deploy the new configurations, but shouldn't update the configuration beforehand.

In the AWS console, you would see under the "details" tab the old Launch Template Version, while in the Instance Refresh tag it would run with the target configuration being the newer launch template version.

This would imply that in case the new launch template version (2) fails on healthcheck for a while, it would rollback to the previous version (1).

Actual Behavior

Running apply would update the launch template version for the ASG, AND THEN triggers instance refresh with target configuration being the latest launch template version.

This imposes that should the newer launch template version (2) fail on healthcheck for a while, it would "roll back" to the same newer launch template version (2), making the ASG stuck in a perpetual rollback replacing the unhealthy launch template over and over.

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

data "aws_ami" "ami" {
  most_recent = true
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
  }
  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
  owners = ["099720109477"] # Canonical
}

resource "aws_launch_template" "sample" {
  name = "sample-lt"

  image_id                             = data.aws_ami.ami.id
  instance_initiated_shutdown_behavior = "terminate"
  instance_type                        = "t3.small"
# user_data = base64encode("#!/bin/bash\nsudo ifconfig eth0 down") # This is to purposefully fail the EC2 health check, to test out the Instance Refresh Rollback
}

resource "aws_autoscaling_group" "sample" {
  name  = "sample-asg"

  launch_template {
    id      = aws_launch_template.sample.id
    version = aws_launch_template.sample.latest_version
  }
  min_size                  = 1
  max_size                  = 1
  desired_capacity          = 1
  health_check_type         = "EC2" # Can be ELB as well
  health_check_grace_period = 20
  wait_for_capacity_timeout = "5m"

  instance_refresh {
    strategy = "Rolling"
    # Preferences can be anything, as long as auto_rollback = true
    preferences {
      auto_rollback = true
      min_healthy_percentage = 75
      skip_matching          = false
    }
  }
}

Steps to Reproduce

  1. Run terraform apply, it should complete successfully, creating the ASG and LT
  2. Ensure in the AWS console that the ASG is created and its instance is launched
  3. Change the launch template, either the instance type to something else that would purposefully fail the EC2 health check
  4. Run terraform apply again, it should complete successfully again
  5. Ensure in the AWS console that: a. The ASG's "Details" tab has changed from Launch Template Version 1 -> 2 b. The ASG's "Instance Refresh" tab shows it's undergoing the procedure

If the new template is unhealthy, you should also see that the instance refresh would fail after an hour, but the rollback would launch the Newer Launch Template Version (2) forever.

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

No

github-actions[bot] commented 10 months ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

arianvp commented 3 weeks ago

Yeh this is a bug in the provider.

When triggering an instance refresh, Terraform should not modify the Auto Scaling Group before calling StartInstanceRefresh. Instead it should pass

DesiredConfiguration:
  LaunchTemplate:
      LaunchTemplateName: <name>
      Version: <version>

to the StartInstanceRefresh call.

Then the instance refresh will update the ASG to point to the new Launch Template Version IF and only IF the instance refresh succeeded. If it fails; it will use the existing Launch Template Version set on the ASG for the rollback.

If you update the launch template version manually beforehand; this confuses any of the rollback logic.

In short: The terraform provider should never manually update the launch template parameters on the ASG. Instead it should pass DesiredConfiguration to StartInstanceRefresh

arianvp commented 3 weeks ago

The same has to happen for changing mixed_instances_policy.

In that case DesiredConfiguration.MixedInstancesPolicy should be set with the fields that need updating.