hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.7k stars 9.07k forks source link

Stack set instance remains when an OU is removed in organizational_unit_ids #25253

Closed e88z4 closed 1 month ago

e88z4 commented 2 years ago

Community Note

Terraform CLI and Terraform AWS Provider Version

Terraform v1.2.2
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v4.17.1

Affected Resource(s)

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

resource "aws_cloudformation_stack_set_instance" "stackset_instance" {
  stack_set_name = aws_cloudformation_stack_set.stackset.name

  deployment_targets {
    organizational_unit_ids = var.stack_instance_target_ou
  }

  for_each = var.stack_instance_target_region
  region   = each.value
}
variable "stack_instance_target_ou" {
  description = "target ou for the stack set deployment"
  type        = set(string)
}

Debug Output

Panic Output

Expected Behavior

When removing an OU from a set of OUs using organizational_unit_ids inside the deployment_targets block, the stack set instance of the removed OU should be deleted.

Actual Behavior

When removing an OU from a set of OUs using organizational_unit_ids inside the deployment_targets block, the stack set instance of the removed OU remains.

  1. In my var.stack_instance_target_ou, I have a set of string of OUs for example ["my-ou-id-1","my-ou-id-2"]
  2. do terraform apply, stackset and stackset instances for the accounts under the targeted OUs ["my-ou-id-1","my-ou-id-2"] are created in AWS.
  3. Change the var.stack_instance_target_ou to only target ["my-ou-id-1"]
  4. do terraform apply, stackset instances for the ["my-ou-id-2"] remain

Important Factoids

References

e88z4 commented 2 years ago

I found that upon removal of an OU, the Terraform recognize that there is an OU removal. The state file removes the deleted OU as well.

 ~ deployment_targets {
          ~ organizational_unit_ids = [
              - "my-ou-id-2",
                # (1 unchanged element hidden)
            ]
        }

Then I did a query against the AWS cli and found the following:

aws cloudformation describe-stack-set --stack-set-name my-stack-set
"OrganizationalUnitIds": [
            "my-ou-id-1",
            "my-ou-id-2"
        ]

There is a disconnect between the Terraform state versus what is actually in stack set itself.

e88z4 commented 2 years ago

I ran the terraform apply in the debugging mode. It seems Terraform does a stack update AWS api call when the OU that is subject for removal is removed.

In order to delete, Terraform needs to call stackset deletion API instead. https://docs.aws.amazon.com/AWSCloudFormation/latest/APIReference/API_DeleteStackSet.html

geof2001 commented 1 year ago

Having this issue as well. I noticed in the state for the stackset_instance resource it shows the AWS account ID of the first stack that was created with the resource stackset_instance. It's not the account of ID of the stackset's account but just the first stack instance. If I remove an OU or even try to destroy the whole stackset it only deletes the 1st stackset instance in the account_id listed "123456789012"

resource "aws_cloudformation_stack_set_instance" "stack_set_instance_external_ou" {
    account_id             = "123456789012"
    call_as                = "SELF"
    id                     = "Managed-IAM-Roles, 123456789012,us-east-2"
    organizational_unit_id = "ou-bva7-xrajzix1"
    region                 = "us-east-2"
    retain_stack           = false
    stack_id               = "arn:aws:cloudformation:us-east-2: 123456789012:stack/StackSet-Managed-IAM-Roles-f7c380e9-47ac-4806-b2db-eea4ca1a0d95/ef4a4d00-612d-11ed-bc18-06931f26a384"
    stack_set_name         = "Managed-IAM-Roles"

    deployment_targets {
        organizational_unit_ids = [
            "ou-bva0-abcdefg1",
            "ou-bva2-hijklmno3",
        ]
    }
}

I'm working around this by creating separate instances for each OU. If there are multiple accounts in that particular OU we still end up with the same issue when removing stacks.

ben-earl-tfs commented 1 year ago

Hitting the same issue. Starting to develop a module to manage our stack sets. Assigning multiple OUs to each "aws_cloudformation_stack_set_instance" and a destroy only removes the first stack in the first account in the first OU. I see the account number in the state. When a destroy is run, I have to then manually log in and remove the stack set instances that are not destroy before I can destroy the stack set itself.

main.tf

resource "aws_cloudformation_stack_set" "infra" {
  for_each = var.stack_sets

  name             = each.key
  description      = each.value.description
  template_body    = file(each.value.cft_file)
  permission_model = "SERVICE_MANAGED"

  capabilities = ["CAPABILITY_NAMED_IAM", "CAPABILITY_AUTO_EXPAND"]

  auto_deployment {
    enabled                          = true
    retain_stacks_on_account_removal = false
  }

  timeouts {
    update = "2h"
  }
}

resource "aws_cloudformation_stack_set_instance" "infra" {
  for_each = {
    for name, config in var.stack_sets : name => config.ou_ids
  }

  stack_set_name = each.key

  deployment_targets {
    organizational_unit_ids = each.value
  }

  operation_preferences {
    failure_tolerance_percentage = 50
    max_concurrent_percentage    = 50
  }

  timeouts {
    create = "2h"
    update = "2h"
    delete = "2h"
  }

  depends_on = [
    aws_cloudformation_stack_set.infra
  ]
}

variables.tf

variable "stack_sets" {
  type = map(object({
    cft_file    = string
    description = string
    ou_ids      = list(string)
  }))
  description = "A map with StackSet names as keys and an object containing the json cft, description, and OUs to push the stack to."
  default     = {}
}

terraform.auto.tfvars

stack_sets = {
  "all-infra-default" = {
    cft_file    = "cfts/infra-roles.template.json"
    description = "Default infrastructure stacks to go in all accounts"
    ou_ids      = ["ou-xyz-abcdefgh", "ou-zyx-hgfedcba", "ou-yxz-lmnopqrs"]
  }
  "group-infra-default" = {
    cft_file    = "cfts/group-infra-roles.template.json"
    description = "Group OU specific stacks"
    ou_ids      = ["ou-zzy-hijklmno", "ou-zyy-a1b2c3d4"]
  }
}

This will create two stack sets and two stack set instances. The stack does apply to the entire OU, including accounts in all sub-OUs. The first stack set instance has an account ID of the first account the stack set instance encounters in the OU tree. When a destroy is run, only the stack instance that was deployed to that account is removed, any subsequent accounts remain, so the destroy can't remove the stack set resources.

Adding or removing OUs from the list per stack_set results in error, an OU that has no accounts results in error, and destroy fails because stack set instances remain post destroy.

If I run: terraform destroy -target=aws_cloudformation_stack_set_instance.infra[\"all-infra-default\"]

This does destroy a resource, but only the in that first account in the OU tree it found. Subsequently running: terraform destroy -target=aws_cloudformation_stack_set.infra[\"all-infra-default\"]

I'd expect it to destroy the stack set targeted. It seems to want to destroy both stack set resources for some reason, not just the targeted (even though it does warn that it was targeted), and it does.

It is really hard to judge the behavior of what will happen after adding or removing OUs, and what will actually get pushed or not. I haven't run through all scenarios yet, but there is definitely something amiss with managing stack sets though.

Currently it looks like the only way to manage the stacks is to apply them to specific accounts, but we're working with hundreds, which becomes unfeasible like this. We'll have to develop something externally that pulls accounts from OUs and feeds terraform.

Preskton commented 1 year ago

I'm encountering a similar behavior when attempting to delete the base stackset itself as well - the delete of the stackset doesn't happen because I'm guessing the individuals stacks distributed to our accounts aren't getting cleaned up as expected, resulting in terraform yielding:

StackSetNotEmptyException: StackSet is not empty status code: 409
ben-earl-tfs commented 1 year ago

I had some thought on it and this behavior makes sense. It is like an ASG -- Terraform can create it, create the launch config, and stuff happens, but AWS ultimately manages the state of the machines in there based on the asg/launch config params. This is similar, terraform can create the stack set, and apply the parameters around where the stacks are to be deployed, but AWS ultimately is managing the state of the stacks inside of it.

I'm not sure there is a solution that can be done in terraform at all regarding this since the stack set is stack state tracking concept. Our roadmap will be dismantling stack sets and deploying individual stacks to accounts during build, shifting the state responsibility to terraform. All the same functionality, just state managed in terraform like we want!

github-actions[bot] commented 1 month ago

[!WARNING] This issue has been closed, meaning that any additional comments are hard for our team to see. Please assume that the maintainers will not see them.

Ongoing conversations amongst community members are welcome, however, the issue will be locked after 30 days. Moving conversations to another venue, such as the AWS Provider forum, is recommended. If you have additional concerns, please open a new issue, referencing this one where needed.

github-actions[bot] commented 1 month ago

This functionality has been released in v5.58.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

github-actions[bot] commented 1 week ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.