hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.76k stars 9.11k forks source link

AWS SSO - Throttling Exception: Rate exceeded #25552

Closed frankpengau closed 6 months ago

frankpengau commented 2 years ago

Hey Terraform AWS Provider Community,

Hope you're all well.

I'm having an issue with the AWS SSO Admin service in terraform aws provider.

To give a brief background, we are using SSOAdmin to spin up about 40+ Permission Sets, to correlate with our 40+ IAM Roles.

Each permission set will have an inline policy and a managed policy set. Afterwards, it will have an account assignment for a specific environment (aws account), targeting a specific aws sso group.

We've structured the setup as modules:

An example of one of our Permission Set terraform files:

data "aws_ssoadmin_instances" "GroupA-admin" {}

resource "aws_ssoadmin_permission_set" "GroupA-admin" {
  name             = "GroupA-admin"
  description      = "GroupA"
  instance_arn     = tolist(data.aws_ssoadmin_instances.GroupA.arns)[0]
  session_duration = "PT8H"
  tags               = {
              Name = "GroupA-admin"
  }
}

An example of one of our Policy terraform files:

data "aws_ssoadmin_instances" "GroupA-admin" {}

data "aws_ssoadmin_permission_set" "GroupA-admin" {
  instance_arn = tolist(data.aws_ssoadmin_instances.GroupA-admin.arns)[0]
  name         = "GroupA-admin"
}

data "aws_iam_policy_document" "GroupA-admin" {
  statement {
    actions = [
      "sts:AssumeRole"
    ]
    resources = [
      "arn:aws:iam::*:role/GroupA-admin"
    ]
  }
}

resource "aws_ssoadmin_permission_set_inline_policy" "GroupA-admin" {
  inline_policy      = data.aws_iam_policy_document.GroupA-admin.json
  instance_arn       = data.aws_ssoadmin_permission_set.GroupA-admin.instance_arn
  permission_set_arn = data.aws_ssoadmin_permission_set.GroupA-admin.arn

}

resource "aws_ssoadmin_managed_policy_attachment" "GroupA-admin-managed-policy-attachment" {
  instance_arn       = data.aws_ssoadmin_permission_set.GroupA-admin.instance_arn
  managed_policy_arn = "arn:aws:iam::aws:policy/ReadOnlyAccess"
  permission_set_arn = data.aws_ssoadmin_permission_set.GroupA-admin.arn
}

An example of one of our Account Assignment terraform files:

data "aws_ssoadmin_instances" "GroupA-admin" {}

output "identity_store_id_GroupA-admin" {
  value = tolist(data.aws_ssoadmin_instances.GroupA-admin.identity_store_ids)[0]
}

data "aws_ssoadmin_permission_set" "GroupA-admin" {
  instance_arn = tolist(data.aws_ssoadmin_instances.GroupA-admin.arns)[0]
  name         = "GroupA-admin"
}

resource "aws_ssoadmin_account_assignment" "GroupA-admin-assign-dev-apse2" {
  instance_arn       = data.aws_ssoadmin_permission_set.GroupA-admin.instance_arn
  permission_set_arn = data.aws_ssoadmin_permission_set.GroupA-admin.arn
  principal_id = var.groups["GroupA"].groupid
  principal_type = "GROUP"

  target_id   = var.dev-apse2
  target_type = "AWS_ACCOUNT"
}

We have issues with the terraform stuck with "Still Creating..." for a large number of AWS SSO Calls.

We have tried to set the TF_CLI_ARGS_apply = "-parallelism=1" and also diagnose the issue via TF_LOG_PROVIDER=DEBUG.

What we are currently running into, is issues with the terraform aws provider making too many calls to the AWS SSO APIs (aws-sdk-go) causing a Throttling Exception: Rate exceeded error, when making DescribePermissionSet calls too often, in order to validate that the permission set has been set, due to the eventual consistency nature of those CreatePermissionSet calls.

According to the AWS SSO Quota for AWS SSO throttle limits:

AWS SSO APIs have a collective throttle limit maximum of 20 transactions per second (TPS). The CreateAccountAssignment has a maximum rate of 10 outstanding async calls. These quotas cannot be changed.

Source: https://docs.aws.amazon.com/singlesignon/latest/userguide/limits.html#ssothrottlelimits

So we suspect that we are hitting the 20 TPS limit and therefore encountering the Throttling Exception. Is there a way to reduce the number of calls for validating the permission sets per second? So that we don't go over the limit?

A sample of the DEBUG logs:

-----------------------------------------------------: timestamp=2022-06-23T16:02:20.252Z
2022-06-23T16:02:20.265Z [DEBUG] provider.terraform-provider-aws_v4.19.0_x5: [aws-sdk-go] DEBUG: Retrying Request SSO Admin/DescribePermissionSet, attempt 4: timestamp=2022-06-23T16:02:20.265Z
2022-06-23T16:02:20.265Z [DEBUG] provider.terraform-provider-aws_v4.19.0_x5: [aws-sdk-go] DEBUG: Request SSO Admin/DescribePermissionSet Details:
---[ REQUEST POST-SIGN ]-----------------------------
POST / HTTP/1.1
Host: sso.ap-southeast-2.amazonaws.com
User-Agent: APN/1.0 HashiCorp/1.0 Terraform/1.0.7 (+https://www.terraform.io/) terraform-provider-aws/dev (+https://registry.terraform.io/providers/hashicorp/aws) aws-sdk-go/1.44.34 (go1.17.6; linux; amd64)
Content-Length: 157
Authorization: AWS4-HMAC-SHA256 Credential=ASIAXXXXXXXXXXXXXXXX/20220623/ap-southeast-2/sso/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date;x-amz-security-token;x-amz-target, Signature=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Content-Type: application/x-amz-json-1.1
X-Amz-Date: 20220623T160220Z
X-Amz-Security-Token: xxxxx
X-Amz-Target: SWBExternalService.DescribePermissionSet
Accept-Encoding: gzip
{"InstanceArn":"arn:aws:sso:::instance/ssoins-xxxxxxxxxxxxxxxx","PermissionSetArn":"arn:aws:sso:::permissionSet/ssoins-xxxxxxxxxxxxxxxx/ps-1234567890123456"}
-----------------------------------------------------: timestamp=2022-06-23T16:02:20.265Z
2022-06-23T16:02:20.293Z [DEBUG] provider.terraform-provider-aws_v4.19.0_x5: [aws-sdk-go] DEBUG: Response SSO Admin/DescribePermissionSet Details:
---[ RESPONSE ]--------------------------------------
HTTP/2.0 400 Bad Request
Content-Length: 58
Content-Type: application/x-amz-json-1.1
Date: Thu, 23 Jun 2022 16:02:20 GMT
X-Amzn-Requestid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
-----------------------------------------------------: timestamp=2022-06-23T16:02:20.293Z
2022-06-23T16:02:20.293Z [DEBUG] provider.terraform-provider-aws_v4.19.0_x5: [aws-sdk-go] {"__type":"ThrottlingException","message":"Rate exceeded"}: timestamp=2022-06-23T16:02:20.293Z
2022-06-23T16:02:20.293Z [DEBUG] provider.terraform-provider-aws_v4.19.0_x5: [aws-sdk-go] DEBUG: Validate Response SSO Admin/DescribePermissionSet failed, attempt 4/25, error ThrottlingException: Rate exceeded: timestamp=2022-06-23T16:02:20.293Z
2022-06-23T16:02:20.295Z [DEBUG] provider.terraform-provider-aws_v4.19.0_x5: [aws-sdk-go] DEBUG: Retrying Request SSO Admin/DescribePermissionSet, attempt 1: timestamp=2022-06-23T16:02:20.295Z
2022-06-23T16:02:20.296Z [DEBUG] provider.terraform-provider-aws_v4.19.0_x5: [aws-sdk-go] DEBUG: Request SSO Admin/DescribePermissionSet Details:
frankpengau commented 2 years ago

One thing that I would like to add, is that this is run via a pipeline and timed out after 8 hours.

For example, having limited it to 1 Managed Policy Attachment can take over an hour and still be stuck on "Still Creating..." and I have triple checked and it is not a duplicate managed policy attachment issue, like the one referenced in issue: #21543

frankpengau commented 2 years ago

Also the ThrottlingException can also be seen in CloudTrail, as follows:

"eventTime": "2022-06-23T16:02:20Z",
    "eventSource": "sso.amazonaws.com",
    "eventName": "DescribePermissionSet",
    "awsRegion": "ap-southeast-2",
    "sourceIPAddress": "12.34.56.78",
    "userAgent": "APN/1.0 HashiCorp/1.0 Terraform/1.0.7 (+https://www.terraform.io) terraform-provider-aws/dev (+https://registry.terraform.io/providers/hashicorp/aws) aws-sdk-go/1.44.34 (go1.17.6; linux; amd64)",
    "errorCode": "ThrottlingException",
    "errorMessage": "Rate exceeded",
    "requestParameters": null,
    "responseElements": null,
    "requestID": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "eventID": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "readOnly": true,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "123456789012",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.2",
        "cipherSuite": "XXXXX-XXX-XXXXXX-XXX-XXXXXX",
        "clientProvidedHostHeader": "sso.ap-southeast-2.amazonaws.com"
    }
lanzrein commented 1 year ago

Hello,

I am encountering a similar issue with Identity Center.

Using the following block :

data "aws_identitystore_user" "this" {

  identity_store_id = var.identity_store_id

  filter {
    attribute_path  = "UserName"
    attribute_value = var.username
  }
}

We get the following type of error in the terraform log :

│ Error: reading AWS SSO Identity Store User Data Source (d-xxxxxxxxxx): operation error identitystore: DescribeUser, https response error StatusCode: 400, RequestID: xxxxxxxxx, deserialization failed, failed to decode response body, invalid character '<' looking for beginning of value
│
│   with data.aws_identitystore_user.this,
│   on data.tf line xx, in data "aws_identitystore_user" "this":
│   17: data "aws_identitystore_user" "this" {

After investigating in the AWS Cloudtrail, I noticed that it is indeed rate limiting :

{
  "eventVersion": "1.08",

  "eventTime": "2022-10-25T14:47:41Z",
  "eventSource": "sso.amazonaws.com",
  "eventName": "ListAccountAssignments",
  "awsRegion": "eu-west-1",
  "sourceIPAddress": "xx.xx.xx.xx",
  "userAgent": "APN/1.0 HashiCorp/1.0 Terraform/1.2.9 (+https://www.terraform.io) terraform-provider-aws/dev (+https://registry.terraform.io/providers/hashicorp/aws) aws-sdk-go/1.44.117 (go1.18.4; linux; amd64)",
  "errorCode": "ThrottlingException",
  "errorMessage": "Rate exceeded",
  "requestParameters": null,
  "responseElements": null,
  "requestID": "xxxxxx",
  "eventID": "xxxxxx",
  "readOnly": true,
  "eventType": "AwsApiCall",
  "managementEvent": true,
  "recipientAccountId": "xxxxxxx",
  "eventCategory": "Management",
  "tlsDetails": {
    "tlsVersion": "TLSv1.2",
    "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
    "clientProvidedHostHeader": "sso.eu-west-1.amazonaws.com"
  }
}

Did you find a workaround while waiting for your MR to be accepted ?

devinbfergy commented 7 months ago

I am running into this issue pretty heavily. Has anyone figured out a solution? We have about 30ish permission sets with about 50 groups getting assigned them and we can't even plan without the rate limit.

│ Error: reading SSO Managed Policy Attachment (arn:aws:iam::aws:policy/ReadOnlyAccess,arn:aws:sso:::permissionSet/ssoins-STUFF/ps-STUFF,arn:aws:sso:::instance/ssoins-STUFF): operation error SSO Admin: ListManagedPoliciesInPermissionSet, failed to get rate limit token, retry quota exceeded, 2 available, 5 requested

kazukousen commented 7 months ago

I too have run into this issue. In my case, the cause seems to be an over-configuration of parallelism.

$ echo $TF_CLI_ARGS_plan
--parallelism=100000
$ echo $TF_CLI_ARGS_apply
--parallelism=100000

I rolled back it to the default of 10. The problem no longer occurs.

github-actions[bot] commented 6 months ago

This functionality has been released in v5.42.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

github-actions[bot] commented 5 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.