hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.74k stars 9.1k forks source link

[Bug]: ssoadmin resources do not retry provisioning after a provisioning error #38247

Open mbbush opened 2 months ago

mbbush commented 2 months ago

Terraform Core Version

1.5.5

AWS Provider Version

5.50.0

Affected Resource(s)

Expected Behavior

Rerunning a subsequent apply after the configuration has been corrected should resolve previous errors and lead to a usable state.

Actual Behavior

When applying a aws_ssoadmin_customer_managed_policy_attachment to (incorrectly) attach a policy that doesn't exist in account 1 to a permission set that is assigned to account 1, terraform apply returns an error, but also sets the state in such a way that indicates that it successfully created the resource, so a subsequent terraform apply will do nothing.

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

data "aws_ssoadmin_instances" "example" {}

resource "aws_ssoadmin_permission_set" "example" {
  name         = "Example"
  instance_arn = tolist(data.aws_ssoadmin_instances.example.arns)[0]
}

data "aws_identitystore_group" "example" {
  identity_store_id = tolist(data.aws_ssoadmin_instances.example.identity_store_ids)[0]

  alternate_identifier {
    unique_attribute {
      attribute_path  = "DisplayName"
      attribute_value = "ExampleGroup"
    }
  }
}

resource "aws_ssoadmin_account_assignment" "example" {
  instance_arn       = tolist(data.aws_ssoadmin_instances.example.arns)[0]
  permission_set_arn = data.aws_ssoadmin_permission_set.example.arn

  principal_id   = data.aws_identitystore_group.example.group_id
  principal_type = "GROUP"

  target_id   = "123456789012"
  target_type = "AWS_ACCOUNT"
}

resource "aws_ssoadmin_customer_managed_policy_attachment" "example" {
  instance_arn       = aws_ssoadmin_permission_set.example.instance_arn
  permission_set_arn = aws_ssoadmin_permission_set.example.arn
  customer_managed_policy_reference {
    name = "does-not-exist-in-account-123456789012"
  }
}

Steps to Reproduce

  1. Apply the manifest.
  2. Get an error because the policy does-not-exist-in-account-123456789012 does not exist in the specified account. (maybe you accidentally created it in the wrong account)
  3. Create the policy in the specified account.
  4. Apply the manifest again. There is no diff, so no changes happen.
  5. The policy cannot be used by users in the specified group.

Debug Output

No response

Panic Output

No response

Important Factoids

Digging into the code, I'm pretty sure I know what's going on.

When creating an aws_ssoadmin_customer_managed_policy_attachment resource, the provider performs the following actions

  1. (successfully) update the customer managed policy attachment to tell AWS that the policy that doesn't exist should be attached to the permission set.
  2. (successfully) Set the id of the aws_ssoadmin_customer_managed_policy_attachment in the terraform state. This is correct, because it successfully created the resource, but it's also the cause of the bug because the apply is trying to do two different actions.
  3. (successfully) invoke the asynchronous aws sdk call ssoadmin.ProvisionPermissionSet with the "All Provisioned Accounts" target
  4. Start polling the aws sdk call ssoadmin.DescribePermissionSetProvisioningStatus api call, which eventually returns an error because the permission set cannot be provisioned to all requested accounts.
  5. Return the error from provisioning the permission set.

Then, after I've resolved the issue (by creating the policy in the right account) a subsequent terraform plan shows no diff, so the call to re-provision the permission sets doesn't happen, and the result is that the policy cannot be used in the target account, and a warning is displayed in the IAM identity center console that the latest version of the permission set is not provisioned.

The way I worked around this issue was by slightly altering the session length of all my ssoadmin permission sets, to force a mass update and reprovision.

Based on the implementation I can see, the same issue would affect most of the ssoadmin resources that make a call to provision permission sets during their create, update, or delete methods.

The trouble comes from the fact that the none of these resources track the provisioning status of a permission set in each account, so nothing knows that the provisioning failed and must be retried.

The only idea I can think of to fix this is to update the aws_ssoadmin_permission_set resource to make a ListAccountsForProvisionedPermissionSet api call with ProvisioningStatus=LATEST_PERMISSION_SET_NOT_PROVISIONED during its Read method, and if the result contains more than zero account ids, show that as a diff, somehow, which could then get corrected by calling Update, which would make a call to provision the permission sets (as it already does). I'm not quite sure how the provider could show this diff in a way that makes sense, since there aren't any user-defined parameters of the aws_ssoadmin_permission_set that changed. The best thing I can think of would be a state parameter named account_ids_not_provisioned or something, which would always be [] in the desired state. The observed state would be determined by the ListAccountsForProvisionedPermissionSet api call, and that could show a diff.

I'm struggling to think of a way to express such a parameter using the tf plugin sdk schema in a way that would work and not be confusing.

References

No response

Would you like to implement a fix?

Maybe

github-actions[bot] commented 2 months ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue