hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.83k stars 9.18k forks source link

Destroying Security Groups Takes Forever with Attached SG #265

Closed hashibot closed 1 year ago

hashibot commented 7 years ago

This issue was originally opened by @kojiromike as hashicorp/terraform#8772. It was migrated here as part of the provider split. The original body of the issue is below.


Hi there,

Thank you for opening an issue. Please note that we try to keep the Terraform issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.

Terraform Version

Terraform v0.7.3

Affected Resource(s)

Please list the resources as a list, for example:

Terraform should always update aws_instance security group attachments before trying to destroy an existing security group.

Actual Behavior

Terraform tried to destroy security groups that were attached to an instance. The apply ran for 5 minutes before timing out. Then, Terraform attempted to continue applying by creating new security groups, which failed because the old security groups still existed.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. Create a security group and attach it to an instance. Plan and apply.
  2. Change something so that sg gets recreated. (Taint that security group, perhaps?)
  3. Plan and apply.
camerondavison commented 7 years ago

I think this is probably an ordering thing? maybe if the SG is tainted and needs to be destroyed then it needs to taint anything that is using it. AWS will not allow you to destroy a security group that has instances associated with it.

geoffreywiseman commented 7 years ago

Is there any way it can detect this condition? I had this issue with a security group that I had imported that I hadn't put the instance attachment into terraform yet -- so Terraform hadn't been told the security group was attached, but I wanted to rename the group, and so it basically started tearing down the group and got stuck. Rather than getting stuck, I would prefer Terraform to simply fail when it plans to tear down a group that is attached to an instance, or, at least, fail when it tries and AWS prevents the deletion.

GroovyCarrot commented 6 years ago

This was easily resolved for me by just commenting out the security group id for the instance that is also terraform managed to encourage it to detach it first, and then adding it back in later.

I feel like there's a graceful way that this could be managed, however it probably requires a resource understanding how it interact with another beyond terraform queuing based on interpolation; at least where these quirks occur. So for example, the aws_security_group resource knows that when being renamed/destroyed, it needs to check any aws_instance resources for a matching entry in vpc_security_group_ids in order to detach that first.

It at least certainly needs to fail for situations where the attachment is not terraform managed, rather than hanging indefinitely.

Jonnymcc commented 6 years ago

Would be nice if the security_group resource can detect what it is attached to and then detach from those resources, based on some flag on the security_group resource. Kind of like a resource specific lifecycle action.

twall commented 6 years ago

It is really unexpected to not have any dependency management for security groups, especially since something as simple as a rename requires the resource to be re-created.

boeboe commented 6 years ago

Any update on this issue?

lazzarello commented 6 years ago

I was able to work around this on a smaller deployment (2 instances) by manually detaching the security groups. It seems like there might be something in the docs about using aws_network_interface_sg_attachment in different contexts.

DanielMarquard commented 5 years ago

Same problem here. I'm seeing the issue with spot instance requests, specifically.

Botono commented 5 years ago

Seeing this issue today with security groups attached to Lambda functions.

chillerby commented 5 years ago

I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:

lifecycle { create_before_destroy = true }

within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.

jeohist commented 5 years ago

I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:

lifecycle { create_before_destroy = true }

within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.

This works until you are at the security group limit per network interface (i.e. you have a limit of 5 security groups per network interface, and 5 security groups are attached, it will try to attach a 6th security group and fail).

ryudice commented 5 years ago

Experiencing this issue also... 2 years after being reported...

Dzhuneyt commented 5 years ago

I don't understand how this wasn't addressed yet, given the wide usage of Security Groups on AWS.

babbottscott commented 5 years ago

Would love to see sg assignment be a separate resource. At the least, though, it seems reasonable to have sg destruction detach before destroy. I suppose, though, that the SG to be destroyed may have been assigned to an instance not in $thisrun's state.

eduardohki commented 5 years ago

I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:

lifecycle { create_before_destroy = true }

within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.

@chillerby In your case, the rules were nested into aws_security_group or are they defined in separate aws_security_group_rule resources?

I used the first and your the lifecycle entry didn't make a difference...

gabrielerzinger commented 4 years ago

Any news on this?

Edit: could make it work with @chillerby suggestion, had to add the lifecyle to all the chain using the sg, tks!

ghost commented 4 years ago

Bump. This is quite a frustrating issue.

stuartabramshumphries commented 4 years ago

still an issue, no way to get terraform to recognise a security groups name is being changed without it hanging.

stuartabramshumphries commented 4 years ago

needed to create new security group, attach that to instances, then when terraform had finished and checked old sg not being referenced I could remove old sg from tf and ran again fine. adding lifecycle to chain didnt seem to work though..

ahathaway commented 4 years ago

Such a terrible issue to not have any traction.

ado120 commented 4 years ago

Same issue for me as well. This is a huge one, any fixes?

abeluck commented 4 years ago

Related (closed and locked) issue: https://github.com/hashicorp/terraform/issues/8617

This is quite unfortunate.

gaurish commented 4 years ago

I am still seeing this issue on

Terraform v0.12.24

korobass commented 4 years ago

I can confirm the issue occurs for me as well. When I run terraform destroy, security groups are not detached from EKS cluster network interfaces, even if there are other sgs attached to them.

terraform version: Terraform v0.12.20 provider.aws 2.7.0

MicahCarrick commented 4 years ago

I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:

lifecycle { create_before_destroy = true }

within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.

This will also not work if explicitly assigning a name for the security group as you cannot have duplicate names.

You can append random digits from the random_id resource to the name of the security group which will then work with the create_before_destroy workaround.

Roxyrob commented 4 years ago

Hi @MicahCarrick

I apply lifecycle solution (seen in #2445). It warked for me also with explicit security group name defined (without random part).

I don't know how but terraform seems to handle name conflict.

ilulillirillion commented 4 years ago

This is also affecting me with terraform 13.1 and aws 3.4. The lifecycle workaround proposed above worked my case right now, but I also feel this is an unfortunate issue to not be getting any traction.

arcreative commented 4 years ago

lifecycle { create_before_destroy = true } does indeed break for named SGs, but I've confirmed that you can both make the aforementioned changed and just comment out your names, then when the apply is complete, uncomment your names and reapply.

It's a bit annoying to do it in 2 steps, but it seems unavoidable due to AWS' name collision error... unless the provider detected and handled the name collision using multiple steps.

JCarrascoF commented 3 years ago

Still happening... I hope this could be solved (not using the lifecycle option) with creation before destroy as default SG strategy.

Thanks to everyone who proposed the lifecycle workaround. It works.

magic69 commented 3 years ago

Really ? What is wrong with Hashicorp lately ? Started with the weird upgrade process that breaks everything and now this 7 year old bug ?

kingindanord commented 3 years ago

Still happening...

awilkins commented 3 years ago

This works until you are at the security group limit per network interface (i.e. you have a limit of 5 security groups per network interface, and 5 security groups are attached, it will try to attach a 6th security group and fail).

See also #13593 - it's also impossible to remove the last security group from an ENI, and the "default" group is not a good stopgap for this transition because it's too permissive (by default).

GeekEast commented 3 years ago

still causung frustration

regmicmahesh commented 3 years ago

still facing same problem

rirze commented 2 years ago

Looking further into this issue is super depressing-- from what I can tell, this issue could be resolved with https://github.com/hashicorp/terraform/issues/16065 but that has been open since 2017.

As for my case, I can't recreate a SG because an existing SG (defined in my templates) references it. For some reason. Terraform keeps trying to delete it naively without travelling the dependency graph and deleting dependent objects in order. One would assume that a infrastructure of code tool would understand how to do this basic operation by default.

pspot2 commented 2 years ago

terraform is 🗑️

What would be a better alternative?

pspot2 commented 2 years ago

Hi all,

here is another workaround that doesn't require any manual actions. The trade-off is a bit more ugliness in your code.

As with many other Terraform shortcomings, the lifeboat in this case are local provisioners (again). In a nutshell:

1) Add destroy-time local-exec provisioner to your SG resource. This provisioner will run an AWS CLI command to detach the SG from your downstream resource (VPCe, EC2 instance etc) and replace it with the default SG before the destroy logic is executed. The catch here is that this command needs to know the ID/name of your downstream resource. As the destroy-time provisioner does not allow to reference external resources, you'll need to add a couple of tags to your SG resource that would identify your downstream resource and which could be read by AWS CLI. As this destroy-time provisioner would be run both for replace as well as pure destroy operations, it is important to ignore its errors by setting on_failure = continue.

2) Create a new null resource and add a create-time local-exec provisioner to it. This provisioner will run an AWS CLI command to remove the default SG from your downstream resource. This is cleanup action after the destroy-time provisioner described above. In order to execute this action only if the IDs of the attached SGs have changed, add a triggers block as described below in the examples. As the action of this null resource is not always required (specifically, when creating the resources from scratch), it is important to ignore its errors by setting on_failure = continue. Note: the reason for creating a separate null resource instead of embedding this provisioner into the downstream resource is that we need the provisioner to run after the downstream resource has been created.

Here is a full code example. My downstream resource in this example is a VPC endpoint. Your mileage may vary.

data "aws_security_group" "default" {
    name   = "default"
    vpc_id = var.vpc_id
}

resource "aws_security_group" "private_link_sg" {
    name        = "my_beautiful_sg"
    vpc_id      = var.vpc_id

    # The downstream resource must have at least one SG attached, therefore we
    # attach the default SG of the VPC temporarily and remove it later on
    provisioner "local-exec" {
        when    = destroy
        command = <<DOC
            ENDPOINT_ID=`aws ec2 describe-vpc-endpoints --filters "Name=tag:Name,Values=${self.tags.workaround1}" --query "VpcEndpoints[0].VpcEndpointId" --output text` &&
            aws ec2 modify-vpc-endpoint --vpc-endpoint-id $${ENDPOINT_ID} --add-security-group-ids ${self.tags.workaround2} --remove-security-group-ids ${self.id}
        DOC
        on_failure = continue
    }

    tags = {
        # Destroy-stage provisioners do not allow referencing anything except
        # internal attributes (self, count and each.key). The required values
        # must therefore be added to the tags upfront, which can then be
        # referenced using 'self'.
        workaround1 = var.tags_vpc_endpoint.Name
        workaround2 = data.aws_security_group.default.id
    }
}

resource "aws_vpc_endpoint" "private_link" {
    # here goes the downstream resource...
    ...
    security_group_ids  = [aws_security_group.private_link_sg.id]
    ...
}

resource "null_resource" "remove_default_sg" {
    provisioner "local-exec" {
        command = <<DOC
            aws ec2 modify-vpc-endpoint --vpc-endpoint-id ${aws_vpc_endpoint.private_link.id} --remove-security-group-ids ${data.aws_security_group.default.id}
        DOC
        on_failure = continue
    }

    triggers = {
        rerun_upon_change_of = join(",", aws_vpc_endpoint.private_link.security_group_ids)
    }
}
Etiene commented 2 years ago

@pspot2

here is another workaround that doesn't require any manual actions. The trade-off is a bit more ugliness in your code.

Thanks for suggesting using local-exec, this was the only thing that fixed it for me! 🎉 It reduced my terraform destroy time from over 30 minutes to less than 5 minutes!

I was dealing with lingering ENIs created automatically by a lambda function, which were attached to the SG. Some slight variations for fixing this in my case included:

srakshe commented 2 years ago

@pspot2

here is another workaround that doesn't require any manual actions. The trade-off is a bit more ugliness in your code.

Thanks for suggesting using local-exec, this was the only thing that fixed it for me! 🎉 It reduced my terraform destroy time from over 30 minutes to less than 5 minutes!

I was dealing with lingering ENIs created automatically by a lambda function, which were attached to the SG. Some slight variations for fixing this in my case included:

  • The null-resource and swapping the SG for the default SG weren't needed because the Lambda doesn't require always having one
  • The destroy provisioner needed to be in the lambda resource, not in the SG
  • At the destroy provisioner command, used aws lambda update-function-configuration to clear up the VPC config from the function. Doing this before deleting the function will immediately mark the ENIs for automatic deletion.
  • Used aws ec2 describe-network-interfaces to check the ENIs status and sleep while they still exist or the status is in-use. I didn't need to try force deleting the ENIs too because they are automatically deleted by AWS as soon as they become available. This will wait before the rest of terraform destroy can go on and then the function and the SG will get destroyed just fine. This step is necessary because if the function is destroyed before the ENIs are done being cleaned, they still take long to be destroyed just like before.

Could you post the code sample for the benefit of others ?

antoniogonca95 commented 1 year ago

If you were wondering what the code @Etiene described could look like, here's an example I implemented to fix the exact same issue. I added this to my aws_lambda_function resource

  # This block prevents the lambda's destruction from getting stuck on lingering ENIs
  # It clears the vpc config, and waits until the ENIs have been automatically deleted
  # Before deleting the function + security group
  provisioner "local-exec" {
    when    = destroy
    command = <<DOC
        aws lambda update-function-configuration --function-name ${self.tags.function_name} --vpc-config '{"SubnetIds": [], "SecurityGroupIds": []}' &&
        while result=$(aws ec2 describe-network-interfaces --output text --filters '[{"Name": "group-id", "Values": ["${self.tags.security_group_id}"] }, {"Name": "status", "Values": ["in-use", "available"]}]'); test "$result" != ""; do
          sleep 10;
        done
    DOC
  }

  tags = {
    # Destroy-stage provisioners do not allow referencing anything except
    # internal attributes (self, count and each.key). The required values
    # must therefore be added to the tags upfront, which can then be
    # referenced using 'self'.
    security_group_id = aws_security_group.this.id
    function_name = local.lambda_name
  }

Hope this helps

YakDriver commented 1 year ago

TL;DR

The challenges with security group dependencies cannot be fixed but there are things that can help.

Intro

Thank you very much for your comments and patience on this issue over the years! 🎉

After 6 years, we are going to bring this to a resolution. Unfortunately, resolution does not mean "solution." I'll get into that below. While we cannot fully solve all the challenges, there have been improvements over time and we can make more.

"Unfixable"

Making security groups work seamlessly with other resources in all the ways practitioners expect cannot be entirely fixed as Terraform and AWS work today.

The challenges with security groups are shared by other resources with dependency relationships. However, security groups may be more tricky than most. Here are some reasons why security groups pose challenges that cannot be entirely fixed. Hopefully these help explain why this issue has been open for over 6 years.

Inherent limits

The challenges are inherent to the way security groups work. Consider this configuration:

resource "aws_security_group" "test" {
  vpc_id      = aws_vpc.test.id
  name        = "mysg"
}

resource "aws_instance" "test" {
  # ... other config ...
  vpc_security_group_ids = [aws_security_group.test.id]
}

In Terraform, the aws_instance.test resource refers to the security group, aws_security_group.test. In other words, to Terraform, the resource depends on the security group. This works well because, for example, the security group is created first, before the resource, so they can be correctly associated.

However, if you want to recreate the security group, AWS prevents deletion because the security group has a dependency, the resource!

Within Terraform, the aws_security_group.test has no knowledge of aws_instance.test. Managing this would require 2-way (cyclic) dependencies. Terraform is based on a Directed Acyclic Graph (DAG), which has many advantages, but doesn't allow for modeling this type of 2-way dependency.

Sheer numbers

While it might be possible to fudge things in the code for a single complication, the number of resources that reference security groups prevents one-off fudges. Currently, there are 101 resources with an argument that refers to security groups. Adding any kind of specialized logic to aws_security_group to manage 2-way dependencies for this many resources would be difficult, error prone, and unmaintainable.

With 2-way dependency information, what is the fix?

Even assuming we were going to employ any and all means to make this work, what would that even look like?

Take, for example, the situation where you have a security group associated with an EC2 instance. To be able to delete the security group would require modifying the EC2 instance to substitute another security group, but which one? We wouldn't want the security group resource modifying the EC2 instance because it would introduce drift and/or wouldn't necessarily be under management of Terraform. This violates the design principles of Terraform.

Certainly, when something as simple as a rename causes the security group to be forcibly replaced, it would be convenient to somehow handle the association problem. At the moment, AWS has not given the world an easy way to do this.

Partial fixes and improvements

replace_triggered_by

The new meta argument element of the lifecycle block can solve this problem without provisioners. However, the big downside is that it will recreate the resource associated with the security group. For cattle infrastructures, this may be acceptable.

Adding to the example from earlier, we have this:

resource "aws_security_group" "test" {
  vpc_id      = aws_vpc.test.id
  name        = "mysg"
}

resource "aws_instance" "test" {
  # ... other config ...
  vpc_security_group_ids = [aws_security_group.test.id]

  lifecycle {
    replace_triggered_by = [
      aws_security_group.test.name
    ]
  }
}

If you want to try this out, here's how you can:

  1. terraform apply
  2. Rename the security group to mysg2 and apply again
  3. It will work but replacing the EC2 instance as well.

Shorter timeout

If you are seeing long delays before security groups report that they have dependent objects, such as an EC2 instance, we recommend using a shorter timeout. Note, the default delete timeout is currently 15m. We are making changes to ensure the timeout you set is more respected. For example, to fail faster in a dependency situation, consider lowing the delete timeout to 2-5 minutes.

resource "aws_security_group" "test" {
  vpc_id      = aws_vpc.test.id
  name        = "mysg"

  timeouts {
    delete = "2m"
  }
}

Documentation

The PR that closes this issue adds significant documentation about why the problem occurs, adds examples, including with replace_triggered_by and provisioners.

github-actions[bot] commented 1 year ago

This functionality has been released in v4.60.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.