Closed hashibot closed 1 year ago
I think this is probably an ordering thing? maybe if the SG is tainted and needs to be destroyed then it needs to taint anything that is using it. AWS will not allow you to destroy a security group that has instances associated with it.
Is there any way it can detect this condition? I had this issue with a security group that I had imported that I hadn't put the instance attachment into terraform yet -- so Terraform hadn't been told the security group was attached, but I wanted to rename the group, and so it basically started tearing down the group and got stuck. Rather than getting stuck, I would prefer Terraform to simply fail when it plans to tear down a group that is attached to an instance, or, at least, fail when it tries and AWS prevents the deletion.
This was easily resolved for me by just commenting out the security group id for the instance that is also terraform managed to encourage it to detach it first, and then adding it back in later.
I feel like there's a graceful way that this could be managed, however it probably requires a resource understanding how it interact with another beyond terraform queuing based on interpolation; at least where these quirks occur. So for example, the aws_security_group
resource knows that when being renamed/destroyed, it needs to check any aws_instance
resources for a matching entry in vpc_security_group_ids
in order to detach that first.
It at least certainly needs to fail for situations where the attachment is not terraform managed, rather than hanging indefinitely.
Would be nice if the security_group resource can detect what it is attached to and then detach from those resources, based on some flag on the security_group resource. Kind of like a resource specific lifecycle action.
It is really unexpected to not have any dependency management for security groups, especially since something as simple as a rename requires the resource to be re-created.
Any update on this issue?
I was able to work around this on a smaller deployment (2 instances) by manually detaching the security groups. It seems like there might be something in the docs about using aws_network_interface_sg_attachment
in different contexts.
Same problem here. I'm seeing the issue with spot instance requests, specifically.
Seeing this issue today with security groups attached to Lambda functions.
I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:
lifecycle { create_before_destroy = true }
within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.
I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:
lifecycle { create_before_destroy = true }
within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.
This works until you are at the security group limit per network interface (i.e. you have a limit of 5 security groups per network interface, and 5 security groups are attached, it will try to attach a 6th security group and fail).
Experiencing this issue also... 2 years after being reported...
I don't understand how this wasn't addressed yet, given the wide usage of Security Groups on AWS.
Would love to see sg assignment be a separate resource. At the least, though, it seems reasonable to have sg destruction detach before destroy. I suppose, though, that the SG to be destroyed may have been assigned to an instance not in $thisrun's state.
I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:
lifecycle { create_before_destroy = true }
within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.
@chillerby In your case, the rules were nested into aws_security_group
or are they defined in separate aws_security_group_rule
resources?
I used the first and your the lifecycle entry didn't make a difference...
Any news on this?
Edit: could make it work with @chillerby suggestion, had to add the lifecyle to all the chain using the sg, tks!
Bump. This is quite a frustrating issue.
still an issue, no way to get terraform to recognise a security groups name is being changed without it hanging.
needed to create new security group, attach that to instances, then when terraform had finished and checked old sg not being referenced I could remove old sg from tf and ran again fine. adding lifecycle to chain didnt seem to work though..
Such a terrible issue to not have any traction.
Same issue for me as well. This is a huge one, any fixes?
Related (closed and locked) issue: https://github.com/hashicorp/terraform/issues/8617
This is quite unfortunate.
I am still seeing this issue on
Terraform v0.12.24
I can confirm the issue occurs for me as well. When I run terraform destroy, security groups are not detached from EKS cluster network interfaces, even if there are other sgs attached to them.
terraform version: Terraform v0.12.20 provider.aws 2.7.0
I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:
lifecycle { create_before_destroy = true }
within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.
This will also not work if explicitly assigning a name for the security group as you cannot have duplicate names.
You can append random digits from the random_id
resource to the name
of the security group which will then work with the create_before_destroy
workaround.
Hi @MicahCarrick
I apply lifecycle solution (seen in #2445). It warked for me also with explicit security group name defined (without random part).
I don't know how but terraform seems to handle name conflict.
This is also affecting me with terraform 13.1 and aws 3.4. The lifecycle workaround proposed above worked my case right now, but I also feel this is an unfortunate issue to not be getting any traction.
lifecycle { create_before_destroy = true }
does indeed break for named SGs, but I've confirmed that you can both make the aforementioned changed and just comment out your names, then when the apply is complete, uncomment your names and reapply.
It's a bit annoying to do it in 2 steps, but it seems unavoidable due to AWS' name collision error... unless the provider detected and handled the name collision using multiple steps.
Still happening... I hope this could be solved (not using the lifecycle option) with creation before destroy as default SG strategy.
Thanks to everyone who proposed the lifecycle workaround. It works.
Really ? What is wrong with Hashicorp lately ? Started with the weird upgrade process that breaks everything and now this 7 year old bug ?
Still happening...
This works until you are at the security group limit per network interface (i.e. you have a limit of 5 security groups per network interface, and 5 security groups are attached, it will try to attach a 6th security group and fail).
See also #13593 - it's also impossible to remove the last security group from an ENI, and the "default" group is not a good stopgap for this transition because it's too permissive (by default).
still causung frustration
still facing same problem
Looking further into this issue is super depressing-- from what I can tell, this issue could be resolved with https://github.com/hashicorp/terraform/issues/16065 but that has been open since 2017.
As for my case, I can't recreate a SG because an existing SG (defined in my templates) references it. For some reason. Terraform keeps trying to delete it naively without travelling the dependency graph and deleting dependent objects in order. One would assume that a infrastructure of code tool would understand how to do this basic operation by default.
terraform is 🗑️
What would be a better alternative?
Hi all,
here is another workaround that doesn't require any manual actions. The trade-off is a bit more ugliness in your code.
As with many other Terraform shortcomings, the lifeboat in this case are local provisioners (again). In a nutshell:
1) Add destroy-time local-exec
provisioner to your SG resource. This provisioner will run an AWS CLI command to detach the SG from your downstream resource (VPCe, EC2 instance etc) and replace it with the default SG before the destroy logic is executed. The catch here is that this command needs to know the ID/name of your downstream resource. As the destroy-time provisioner does not allow to reference external resources, you'll need to add a couple of tags to your SG resource that would identify your downstream resource and which could be read by AWS CLI. As this destroy-time provisioner would be run both for replace
as well as pure destroy
operations, it is important to ignore its errors by setting on_failure = continue
.
2) Create a new null resource and add a create-time local-exec
provisioner to it. This provisioner will run an AWS CLI command to remove the default SG from your downstream resource. This is cleanup action after the destroy-time provisioner described above. In order to execute this action only if the IDs of the attached SGs have changed, add a triggers
block as described below in the examples. As the action of this null resource is not always required (specifically, when creating the resources from scratch), it is important to ignore its errors by setting on_failure = continue
. Note: the reason for creating a separate null resource instead of embedding this provisioner into the downstream resource is that we need the provisioner to run after the downstream resource has been created.
Here is a full code example. My downstream resource in this example is a VPC endpoint. Your mileage may vary.
data "aws_security_group" "default" {
name = "default"
vpc_id = var.vpc_id
}
resource "aws_security_group" "private_link_sg" {
name = "my_beautiful_sg"
vpc_id = var.vpc_id
# The downstream resource must have at least one SG attached, therefore we
# attach the default SG of the VPC temporarily and remove it later on
provisioner "local-exec" {
when = destroy
command = <<DOC
ENDPOINT_ID=`aws ec2 describe-vpc-endpoints --filters "Name=tag:Name,Values=${self.tags.workaround1}" --query "VpcEndpoints[0].VpcEndpointId" --output text` &&
aws ec2 modify-vpc-endpoint --vpc-endpoint-id $${ENDPOINT_ID} --add-security-group-ids ${self.tags.workaround2} --remove-security-group-ids ${self.id}
DOC
on_failure = continue
}
tags = {
# Destroy-stage provisioners do not allow referencing anything except
# internal attributes (self, count and each.key). The required values
# must therefore be added to the tags upfront, which can then be
# referenced using 'self'.
workaround1 = var.tags_vpc_endpoint.Name
workaround2 = data.aws_security_group.default.id
}
}
resource "aws_vpc_endpoint" "private_link" {
# here goes the downstream resource...
...
security_group_ids = [aws_security_group.private_link_sg.id]
...
}
resource "null_resource" "remove_default_sg" {
provisioner "local-exec" {
command = <<DOC
aws ec2 modify-vpc-endpoint --vpc-endpoint-id ${aws_vpc_endpoint.private_link.id} --remove-security-group-ids ${data.aws_security_group.default.id}
DOC
on_failure = continue
}
triggers = {
rerun_upon_change_of = join(",", aws_vpc_endpoint.private_link.security_group_ids)
}
}
@pspot2
here is another workaround that doesn't require any manual actions. The trade-off is a bit more ugliness in your code.
Thanks for suggesting using local-exec, this was the only thing that fixed it for me! 🎉 It reduced my terraform destroy time from over 30 minutes to less than 5 minutes!
I was dealing with lingering ENIs created automatically by a lambda function, which were attached to the SG. Some slight variations for fixing this in my case included:
aws lambda update-function-configuration
to clear up the VPC config from the function. Doing this before deleting the function will immediately mark the ENIs for automatic deletion.ec2 describe-network-interfaces
to check the ENIs status and sleep
while they still exist or the status is in-use
. I didn't need to try force deleting the ENIs too because they are automatically deleted by AWS as soon as they become available. This will wait before the rest of terraform destroy can go on and then the function and the SG will get destroyed just fine. This step is necessary because if the function is destroyed before the ENIs are done being cleaned, they still take long to be destroyed just like before.@pspot2
here is another workaround that doesn't require any manual actions. The trade-off is a bit more ugliness in your code.
Thanks for suggesting using local-exec, this was the only thing that fixed it for me! 🎉 It reduced my terraform destroy time from over 30 minutes to less than 5 minutes!
I was dealing with lingering ENIs created automatically by a lambda function, which were attached to the SG. Some slight variations for fixing this in my case included:
- The null-resource and swapping the SG for the default SG weren't needed because the Lambda doesn't require always having one
- The destroy provisioner needed to be in the lambda resource, not in the SG
- At the destroy provisioner command, used
aws lambda update-function-configuration
to clear up the VPC config from the function. Doing this before deleting the function will immediately mark the ENIs for automatic deletion.- Used aws
ec2 describe-network-interfaces
to check the ENIs status andsleep
while they still exist or the status isin-use
. I didn't need to try force deleting the ENIs too because they are automatically deleted by AWS as soon as they become available. This will wait before the rest of terraform destroy can go on and then the function and the SG will get destroyed just fine. This step is necessary because if the function is destroyed before the ENIs are done being cleaned, they still take long to be destroyed just like before.
Could you post the code sample for the benefit of others ?
If you were wondering what the code @Etiene described could look like, here's an example I implemented to fix the exact same issue. I added this to my aws_lambda_function
resource
# This block prevents the lambda's destruction from getting stuck on lingering ENIs
# It clears the vpc config, and waits until the ENIs have been automatically deleted
# Before deleting the function + security group
provisioner "local-exec" {
when = destroy
command = <<DOC
aws lambda update-function-configuration --function-name ${self.tags.function_name} --vpc-config '{"SubnetIds": [], "SecurityGroupIds": []}' &&
while result=$(aws ec2 describe-network-interfaces --output text --filters '[{"Name": "group-id", "Values": ["${self.tags.security_group_id}"] }, {"Name": "status", "Values": ["in-use", "available"]}]'); test "$result" != ""; do
sleep 10;
done
DOC
}
tags = {
# Destroy-stage provisioners do not allow referencing anything except
# internal attributes (self, count and each.key). The required values
# must therefore be added to the tags upfront, which can then be
# referenced using 'self'.
security_group_id = aws_security_group.this.id
function_name = local.lambda_name
}
Hope this helps
The challenges with security group dependencies cannot be fixed but there are things that can help.
Thank you very much for your comments and patience on this issue over the years! 🎉
After 6 years, we are going to bring this to a resolution. Unfortunately, resolution does not mean "solution." I'll get into that below. While we cannot fully solve all the challenges, there have been improvements over time and we can make more.
Making security groups work seamlessly with other resources in all the ways practitioners expect cannot be entirely fixed as Terraform and AWS work today.
The challenges with security groups are shared by other resources with dependency relationships. However, security groups may be more tricky than most. Here are some reasons why security groups pose challenges that cannot be entirely fixed. Hopefully these help explain why this issue has been open for over 6 years.
The challenges are inherent to the way security groups work. Consider this configuration:
resource "aws_security_group" "test" {
vpc_id = aws_vpc.test.id
name = "mysg"
}
resource "aws_instance" "test" {
# ... other config ...
vpc_security_group_ids = [aws_security_group.test.id]
}
In Terraform, the aws_instance.test
resource refers to the security group, aws_security_group.test
. In other words, to Terraform, the resource depends on the security group. This works well because, for example, the security group is created first, before the resource, so they can be correctly associated.
However, if you want to recreate the security group, AWS prevents deletion because the security group has a dependency, the resource!
Within Terraform, the aws_security_group.test
has no knowledge of aws_instance.test
. Managing this would require 2-way (cyclic) dependencies. Terraform is based on a Directed Acyclic Graph (DAG), which has many advantages, but doesn't allow for modeling this type of 2-way dependency.
While it might be possible to fudge things in the code for a single complication, the number of resources that reference security groups prevents one-off fudges. Currently, there are 101 resources with an argument that refers to security groups. Adding any kind of specialized logic to aws_security_group
to manage 2-way dependencies for this many resources would be difficult, error prone, and unmaintainable.
Even assuming we were going to employ any and all means to make this work, what would that even look like?
Take, for example, the situation where you have a security group associated with an EC2 instance. To be able to delete the security group would require modifying the EC2 instance to substitute another security group, but which one? We wouldn't want the security group resource modifying the EC2 instance because it would introduce drift and/or wouldn't necessarily be under management of Terraform. This violates the design principles of Terraform.
Certainly, when something as simple as a rename causes the security group to be forcibly replaced, it would be convenient to somehow handle the association problem. At the moment, AWS has not given the world an easy way to do this.
replace_triggered_by
The new meta argument element of the lifecycle
block can solve this problem without provisioners. However, the big downside is that it will recreate the resource associated with the security group. For cattle infrastructures, this may be acceptable.
Adding to the example from earlier, we have this:
resource "aws_security_group" "test" {
vpc_id = aws_vpc.test.id
name = "mysg"
}
resource "aws_instance" "test" {
# ... other config ...
vpc_security_group_ids = [aws_security_group.test.id]
lifecycle {
replace_triggered_by = [
aws_security_group.test.name
]
}
}
If you want to try this out, here's how you can:
terraform apply
mysg2
and apply againIf you are seeing long delays before security groups report that they have dependent objects, such as an EC2 instance, we recommend using a shorter timeout. Note, the default delete
timeout is currently 15m
. We are making changes to ensure the timeout you set is more respected. For example, to fail faster in a dependency situation, consider lowing the delete
timeout to 2-5 minutes.
resource "aws_security_group" "test" {
vpc_id = aws_vpc.test.id
name = "mysg"
timeouts {
delete = "2m"
}
}
The PR that closes this issue adds significant documentation about why the problem occurs, adds examples, including with replace_triggered_by
and provisioners.
This functionality has been released in v4.60.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.
For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
This issue was originally opened by @kojiromike as hashicorp/terraform#8772. It was migrated here as part of the provider split. The original body of the issue is below.
Hi there,
Thank you for opening an issue. Please note that we try to keep the Terraform issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.
Terraform Version
Terraform v0.7.3
Affected Resource(s)
Please list the resources as a list, for example:
Expected Behavior
Terraform should always update aws_instance security group attachments before trying to destroy an existing security group.
Actual Behavior
Terraform tried to destroy security groups that were attached to an instance. The
apply
ran for 5 minutes before timing out. Then, Terraform attempted to continue applying by creating new security groups, which failed because the old security groups still existed.Steps to Reproduce
Please list the steps required to reproduce the issue, for example: