hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.82k stars 9.17k forks source link

[Bug]: TF often tries to create duplicate security group rules #29797

Open speller opened 1 year ago

speller commented 1 year ago

Terraform Core Version

1.3.7

AWS Provider Version

4.57.0

Affected Resource(s)

In my infra, TF often tried to create duplicate security group resources. On the same security group. When I delete them manually, it creates them again and succeeds. But on the next run it may fail again. But may not. It also happens on the deployments created from scratch earlier by the same configuration.

Expected Behavior

Successful apply

Actual Behavior

│ Error: [WARN] A duplicate Security Group rule was found on (sg-091a5bd0ced7044a9). This may be
│ a side effect of a now-fixed Terraform issue causing two security groups with
│ identical attributes but different source_security_group_ids to overwrite each
│ other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
│ information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: 10.0.0.0/16, TCP, from port: 8080, to port: 8080, ALLOW" already exists
│   status code: 400, request id: d447b3b1-7324-4e26-a494-d1c84ed0d0f0
│ 
│   with module.service-magi.aws_security_group_rule.vpc-http-magi,
│   on service-magi/magi-security-group-rules.tf line 11, in resource "aws_security_group_rule" "vpc-http-magi":
│   11: resource "aws_security_group_rule" "vpc-http-magi" {
│ 
╵
╷
│ Error: [WARN] A duplicate Security Group rule was found on (sg-091a5bd0ced7044a9). This may be
│ a side effect of a now-fixed Terraform issue causing two security groups with
│ identical attributes but different source_security_group_ids to overwrite each
│ other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
│ information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: 10.0.0.0/16, TCP, from port: 8082, to port: 8082, ALLOW" already exists
│   status code: 400, request id: 0c54c265-403f-40c3-9c5b-4b293b8e4be1
│ 
│   with module.service-magi.aws_security_group_rule.vpc-http-sim,
│   on service-magi/sim-security-group-rules.tf line 11, in resource "aws_security_group_rule" "vpc-http-sim":
│   11: resource "aws_security_group_rule" "vpc-http-sim" {
│ 
╵
╷
│ Error: [WARN] A duplicate Security Group rule was found on (sg-091a5bd0ced7044a9). This may be
│ a side effect of a now-fixed Terraform issue causing two security groups with
│ identical attributes but different source_security_group_ids to overwrite each
│ other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
│ information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: 10.0.0.0/16, TCP, from port: 4000, to port: 4000, ALLOW" already exists
│   status code: 400, request id: 0eec0b6e-be4c-4673-8021-83e73dd86dde
│ 
│   with module.service-magi.aws_security_group_rule.vpc-http-sim-ui,
│   on service-magi/sim-ui-security-group-rules.tf line 11, in resource "aws_security_group_rule" "vpc-http-sim-ui":
│   11: resource "aws_security_group_rule" "vpc-http-sim-ui" {

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

Let me know how can I share my state and failing plan with you securely.

The problematic resources are here:

resource "aws_security_group_rule" "lb-http-sim" {
  description = "Allow SIM HTTP server access from load balancer"
  from_port = var.sim.port
  to_port = var.sim.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  source_security_group_id = var.lb.security_group_id
  type = "ingress"
}

resource "aws_security_group_rule" "vpc-http-sim" {
  description = "Allow SIM HTTP access from VPC"
  from_port = var.sim.port
  to_port = var.sim.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  cidr_blocks = [data.aws_vpc.vpc.cidr_block]
  type = "ingress"
}

resource "aws_security_group_rule" "lb-http-sim-ui" {
  description = "Allow SIM UI HTTP server access from load balancer"
  from_port = var.sim_ui.port
  to_port = var.sim_ui.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  source_security_group_id = var.lb.security_group_id
  type = "ingress"
}

resource "aws_security_group_rule" "vpc-http-sim-ui" {
  description = "Allow SIM UI HTTP access from VPC"
  from_port = var.sim_ui.port
  to_port = var.sim_ui.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  cidr_blocks = [data.aws_vpc.vpc.cidr_block]
  type = "ingress"
}

resource "aws_security_group_rule" "lb-http-magi" {
  description = "Allow Magi HTTP server access from load balancer"
  from_port = var.magi.port
  to_port = var.magi.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  source_security_group_id = var.lb.security_group_id
  type = "ingress"
}

resource "aws_security_group_rule" "vpc-http-magi" {
  description = "Allow Magi HTTP access from VPC"
  from_port = var.magi.port
  to_port = var.magi.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  cidr_blocks = [data.aws_vpc.vpc.cidr_block]
  type = "ingress"
}

resource "aws_security_group_rule" "lb-from-magi" {
  description = "Allow access to LB from Magi"
  from_port = 0
  to_port = 0
  protocol = "-1"
  security_group_id = var.lb.security_group_id
  source_security_group_id = module.magi.security_group_id
  type = "ingress"
}

Only the three rules referring to the vpc cidr block are affected. Rules next to them referring to another security group are always fine. There are no duplicate definitions in the configuration (otherwise it will fail always).

In the plan, TF can not get the VPC data and can not determine the VPC cidr block and wants to update it. It eventually updates to the same value and fails:

  # module.service-magi.data.aws_vpc.vpc will be read during apply
  # (depends on a resource or a module with changes pending)
 <= data "aws_vpc" "vpc" {
      + arn                                  = (known after apply)
      + cidr_block                           = (known after apply)
      + cidr_block_associations              = (known after apply)
      + default                              = (known after apply)
      + dhcp_options_id                      = (known after apply)
      + enable_dns_hostnames                 = (known after apply)
      + enable_dns_support                   = (known after apply)
      + enable_network_address_usage_metrics = (known after apply)
      + id                                   = "vpc-xxxx"
      + instance_tenancy                     = (known after apply)
      + ipv6_association_id                  = (known after apply)
      + ipv6_cidr_block                      = (known after apply)
      + main_route_table_id                  = (known after apply)
      + owner_id                             = (known after apply)
      + state                                = (known after apply)
      + tags                                 = (known after apply)
      + timeouts {
          + read = (known after apply)
        }
    }
  # module.service-magi.aws_security_group_rule.vpc-http-magi must be replaced
+/- resource "aws_security_group_rule" "vpc-http-magi" {
      ~ cidr_blocks              = [
          - "10.0.0.0/16",
        ] -> (known after apply) # forces replacement
      ~ id                       = "sgrule-4286748621" -> (known after apply)
      ~ security_group_rule_id   = "sgr-043d1f8924cdac4a5" -> (known after apply)
      + source_security_group_id = (known after apply)
        # (7 unchanged attributes hidden)
    }
  # module.service-magi.aws_security_group_rule.vpc-http-sim must be replaced
+/- resource "aws_security_group_rule" "vpc-http-sim" {
      ~ cidr_blocks              = [
          - "10.0.0.0/16",
        ] -> (known after apply) # forces replacement
      ~ id                       = "sgrule-4144434432" -> (known after apply)
      ~ security_group_rule_id   = "sgr-053c035d3bdc5e5a6" -> (known after apply)
      + source_security_group_id = (known after apply)
        # (7 unchanged attributes hidden)
    }
  # module.service-magi.aws_security_group_rule.vpc-http-sim-ui must be replaced
+/- resource "aws_security_group_rule" "vpc-http-sim-ui" {
      ~ cidr_blocks              = [
          - "10.0.0.0/16",
        ] -> (known after apply) # forces replacement
      ~ id                       = "sgrule-1254889801" -> (known after apply)
      ~ security_group_rule_id   = "sgr-0e04afca151ba885b" -> (known after apply)
      + source_security_group_id = (known after apply)
        # (7 unchanged attributes hidden)
    }

In the plan, it reads a lot of data but can not read the data object to determine its values before apply? Why is it going to recreate rules which already exist?

There are other modules in my configuration built in the same way but all is fine with them.

Steps to Reproduce

-

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

29393

Would you like to implement a fix?

None

github-actions[bot] commented 1 year ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

justinretzolk commented 1 year ago

Hey @speller 👋 I left a comment over on #29393 that I believe will also apply here, in that I believe this comes down to the way data sources behave when they're dependent on another resource or module. If the answer I left on the other issue doesn't help, let me know and I'd be happy to help look around some more.

speller commented 1 year ago

@justinretzolk I understand what you say. But the data object is configured by the vpc_id. And this id is not changed between runs. It comes from an external variable and then propagated to many modules without changes. So the vpc_id of the data resource should be known in the planning stage.

data "aws_vpc" "vpc" {
  id = var.aws.vpc_id
}
speller commented 1 year ago

@justinretzolk One more thing. I have other security groups in different modules which are defined in the same way as the problematic ones. They also use the VPC CIDR block. They're also marked as to be recreated but no issues during the apply stage. Only there three rules in this specific security group are often failing in different deployments. Only these three.

speller commented 1 year ago

@justinretzolk Here is the failing state and plan. After manual deletion of the rules, it is applied with no issues. And later can be applied ok many times. This state is from the second run after creating the deployment from scratch. Duplicate SGR.zip.gpg.zip I've added the zip extension to the gpg file to make it uploaded to github.

speller commented 1 year ago

Here are more details. Dir 1 - failing plan. Dir 2 - manually deleted duplicate rules and succeeded the apply. But, again, this doesn't prevent the issue from occurring next time. I also added the configuration. Duplicate SGR2.zip.gpg.zip

speller commented 1 year ago

@justinretzolk Here I have 2 failures in a row (the configuration is the same, just different variables). 1st run - faced the issue. 2nd run - fixed manually. 3rd run - faced the same again. 4 - fixed again. Duplicate SGR3.zip.gpg.zip

good92 commented 1 year ago

Why not naming these security groups?

bryan-bar commented 8 months ago

I am facing the same issue on repeated applys with terraform v1.5.5. In my case, this didn't start popping up until I used the http data resource to dynamically set the controller's ip.

My temporary fix was to expand each rule to only define 1 cidrblock per rule. This allows terraform to succeed but now plan returns 2 different results:

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.


This is also 2 steps back, 1 forward, since I first remove duplicate cidrblocks and now need to expand it back out per cidrblock. This also slowed down terraform a bit and clutters the plan, however, I prefer this over needing to manually remove resources with `awscli` or taint resources with `terraform`.

** Update: With a larger configuration, around ~500+ rules, this failed as well with a duplicate error for some rules, but others succeeded.

---

I see that there are 2 new resources listed as a note for the `aws_security_group_rule` resource.

... Both of these resource were added before AWS assigned a security group rule unique ID, and they do not work well in all scenarios using thedescription and tags attributes, which rely on the unique ID. The aws_vpc_security_group_egress_rule and aws_vpc_security_group_ingress_rule resources have been added to address these limitations and should be used for all new security group rules...



The issue with the 2 new resources is that it does not accept a list of cidrs, which I would prefer. It also shows it is creating duplicates so it does not solve this issue.