pulumi / pulumi-signalfx

A SignalFX Pulumi resource package, providing multi-language access to SignalFX
Apache License 2.0
3 stars 2 forks source link

Error creating Integration on First Deploy #254

Open jamie1911 opened 1 year ago

jamie1911 commented 1 year ago

What happened?

We always seem to get the following error during the first pulumi up. When I run pulumi up again after the failure, it completes fine.

My guess is, when creating objects, Splunk does some additional steps in the background to set up their side of the AWS role and the additional time is needed for this to happen. Might it make sense to add a retry or something?

@ Updating....
    pulumi:pulumi:Stack splunk-99999999999  warning: use_get_metric_data_method is deprecated: This field will be removed
 +  pulumi:pulumi:Stack splunk-99999999999 creating (0s) warning: use_get_metric_data_method is deprecated: This field will be removed
@ Updating....
 +  signalfx:aws:ExternalIntegration aws-NAME_observability_external_integration creating (0s) 
@ Updating.....
 +  signalfx:aws:ExternalIntegration aws-NAME_observability_external_integration created (2s) 
@ Updating.......
 +  aws:iam:Role splunk-observability-role creating (0s) 
@ Updating....
 +  aws:iam:Role splunk-observability-role created (0.54s) 
@ Updating....
    signalfx:aws:Integration aws-NAME_observability_integration  warning: urn:pulumi:99999999999::splunk::signalfx:aws/integration:Integration::aws-NAME_observability_integration verification warning: "use_get_metric_data_method": [DEPRECATED] This field will be removed
 +  signalfx:aws:Integration aws-NAME_observability_integration creating (0s) warning: urn:pulumi:99999999999::splunk::signalfx:aws/integration:Integration::aws-NAME_observability_integration verification warning: "use_get_metric_data_method": [DEPRECATED] This field will be removed
 +  signalfx:aws:Integration aws-NAME_observability_integration creating (0s) error: 1 error occurred:
 +  signalfx:aws:Integration aws-NAME_observability_integration **creating failed** error: 1 error occurred:
 +  pulumi:pulumi:Stack splunk-99999999999 creating (9s) error: update failed
@ Updating....
 +  pulumi:pulumi:Stack splunk-99999999999 **creating failed (8s)** 1 error; 1 warning
Diagnostics:
  pulumi:pulumi:Stack (splunk-99999999999):
    warning: use_get_metric_data_method is deprecated: This field will be removed
    error: update failed
  signalfx:aws:Integration (aws-NAME_observability_integration):
    warning: urn:pulumi:99999999999::splunk::signalfx:aws/integration:Integration::aws-NAME_observability_integration verification warning: "use_get_metric_data_method": [DEPRECATED] This field will be removed
    error: 1 error occurred:
        * creating urn:pulumi:99999999999::splunk::signalfx:aws/integration:Integration::aws-NAME_observability_integration: Unexpected status code: 400: {
      "code" : 400,
      "errorType" : "validation",
      "failedRegions" : [ "us-east-1" ],
      "message" : "Error validating AWS / Cloudwatch credentials\nValidation failed for following region(s):\nus-east-1\n[ec2] software.amazon.awssdk.services.sts.model.StsException: User: arn:aws:sts::562691491210:assumed-role/eks-us1-cloud-metric-syncer/aws-sdk-java-1690199570814 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::99999999999:role/splunk/splunk-observability (Service: Sts, Status Code: 403, Request ID: eac90aa6-0013-4b9a-9000-cc7be53ca1ea)\n[monitoring] software.amazon.awssdk.services.sts.model.StsException: User: arn:aws:sts::562691491210:assumed-role/eks-us1-cloud-metric-syncer/aws-sdk-java-1690199570814 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::99999999999:role/splunk/splunk-observability (Service: Sts, Status Code: 403, Request ID: 1c994551-421c-4672-a58e-21924cc1f6aa)",
      "successRegions" : [ ]
    }
    Please verify you are using an admin token when working with integrations
Outputs:
    splunk_observability_role_arn: "arn:aws:iam::99999999999:role/splunk/splunk-observability"
Resources:
    + 3 created
Duration: 11s

Expected Behavior

The expected behavior would be that the pulumi_signalfx.aws.ExternalIntegration and pulumi_signalfx.aws.Integration resources both create in a timely\successful manner

Steps to reproduce

Code to reproduce minus some of the parameter setup for pulumi_signalfx.aws.Integration

account_prefix = config.require("account-prefix")

observability_external_integration = pulumi_signalfx.aws.ExternalIntegration(
    f"{account_prefix}_observability_external_integration"
)

observability_role = aws.iam.Role(
    "splunk-observability-role",
    name="splunk-observability",
    path="/splunk/",
    assume_role_policy=pulumi.Output.all(
        observability_external_integration.signalfx_aws_account, observability_external_integration.external_id
    ).apply(
        lambda args: json.dumps(
            {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Sid": "SplunkAssumeRole",
                        "Effect": "Allow",
                        "Principal": {"AWS": args[0]},
                        "Action": "sts:AssumeRole",
                        "Condition": {"StringEquals": {"sts:ExternalId": args[1]}},
                    }
                ],
            }
        )
    ),
    inline_policies=[
        aws.iam.RoleInlinePolicyArgs(
            name="publishpolicy",
            policy=json.dumps(
                {
                    "Version": "2012-10-17",
                    "Statement": [
                        {
                            "Effect": "Allow",
                            "Action": [
                                "apigateway:GET",
                                "autoscaling:DescribeAutoScalingGroups",
                                "cloudcontrol:ListResources",
                                "cloudcontrol:GetResource",
                                "cloudfront:GetDistributionConfig",
                                "cloudfront:ListDistributions",
                                "cloudfront:ListTagsForResource",
                                "cloudwatch:DescribeAlarms",
                                "cloudwatch:GetMetricData",
                                "cloudwatch:GetMetricStatistics",
                                "cloudwatch:ListMetrics",
                                "directconnect:DescribeConnections",
                                "dynamodb:DescribeTable",
                                "dynamodb:ListTables",
                                "dynamodb:ListTagsOfResource",
                                "ec2:DescribeInstances",
                                "ec2:DescribeInstanceStatus",
                                "ec2:DescribeNatGateways",
                                "ec2:DescribeRegions",
                                "ec2:DescribeReservedInstances",
                                "ec2:DescribeReservedInstancesModifications",
                                "ec2:DescribeTags",
                                "ec2:DescribeVolumes",
                                "ecs:DescribeClusters",
                                "ecs:DescribeServices",
                                "ecs:DescribeTasks",
                                "ecs:ListClusters",
                                "ecs:ListServices",
                                "ecs:ListTagsForResource",
                                "ecs:ListTaskDefinitions",
                                "ecs:ListTasks",
                                "eks:DescribeCluster",
                                "eks:ListClusters",
                                "elasticache:DescribeCacheClusters",
                                "elasticloadbalancing:DescribeLoadBalancerAttributes",
                                "elasticloadbalancing:DescribeLoadBalancers",
                                "elasticloadbalancing:DescribeTags",
                                "elasticloadbalancing:DescribeTargetGroups",
                                "elasticmapreduce:DescribeCluster",
                                "elasticmapreduce:ListClusters",
                                "es:DescribeElasticsearchDomain",
                                "es:ListDomainNames",
                                "kinesis:DescribeStream",
                                "kinesis:ListShards",
                                "kinesis:ListStreams",
                                "kinesis:ListTagsForStream",
                                "kinesisanalytics:ListApplications",
                                "kinesisanalytics:DescribeApplication",
                                "lambda:GetAlias",
                                "lambda:ListFunctions",
                                "lambda:ListTags",
                                "logs:DeleteSubscriptionFilter",
                                "logs:DescribeLogGroups",
                                "logs:DescribeSubscriptionFilters",
                                "logs:PutSubscriptionFilter",
                                "organizations:DescribeOrganization",
                                "rds:DescribeDBInstances",
                                "rds:DescribeDBClusters",
                                "rds:ListTagsForResource",
                                "redshift:DescribeClusters",
                                "redshift:DescribeLoggingStatus",
                                "s3:GetBucketLocation",
                                "s3:GetBucketLogging",
                                "s3:GetBucketNotification",
                                "s3:GetBucketTagging",
                                "s3:ListAllMyBuckets",
                                "s3:ListBucket",
                                "s3:PutBucketNotification",
                                "sqs:GetQueueAttributes",
                                "sqs:ListQueues",
                                "sqs:ListQueueTags",
                                "states:ListActivities",
                                "states:ListStateMachines",
                                "tag:GetResources",
                                "workspaces:DescribeWorkspaces",
                            ],
                            "Resource": "*",
                        }
                    ],
                }
            ),
        ),
    ],
    opts=pulumi.ResourceOptions(depends_on=[observability_external_integration])
)

pulumi_signalfx.aws.Integration(
        f"{account_prefix}_observability_integration",
        enabled=True,
        use_get_metric_data_method=True,
        named_token=token_name,
        integration_id=observability_external_integration.id,
        external_id=observability_external_integration.external_id,
        role_arn=pulumi.Output.all(observability_role.arn).apply(lambda args: str(args[0])),
        regions=regions,
        poll_rate=300,
        enable_check_large_volume=True,
        import_cloud_watch=True,
        enable_aws_usage=False,
        namespace_sync_rules=namespace_sync_rules,
        sync_custom_namespaces_only=False,
        custom_namespace_sync_rules=custom_namespace_sync_rules,
        opts=pulumi.ResourceOptions(depends_on=[observability_role, observability_external_integration]),
)

Output of pulumi about

pulumi-3.76.0 pulumi-aws-5.42.0 pulumi-docker-3.6.1 pulumi-gitlab-6.1.1 pulumi-signalfx-5.10.0

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

rquitales commented 1 year ago

@jamie1911 Thanks for reporting this issue and sorry you're facing this. I'm still trying to repo this on my side and will update once I do. To clarify, is this issue of creating an Integration something that occurs frequently? From the logs you provided, it does appear to be a timeout related issue, so attempting a retry might be a potential solution for this.

jamie1911 commented 1 year ago

@jamie1911 Thanks for reporting this issue and sorry you're facing this. I'm still trying to repo this on my side and will update once I do. To clarify, is this issue of creating an Integration something that occurs frequently? From the logs you provided, it does appear to be a timeout related issue, so attempting a retry might be a potential solution for this.

Hello @rquitales, the issue I am facing happens only during the first pulumi_signalfx.aws.ExternalIntegration and first aws.iam.Role to support the first pulumi_signalfx.aws.Integration. Essentially, we create AWS accounts somewhat regularly for different projects or developers. When we create an AWS account, someone goes and adds this new AWS account to splunk observability via a new pulumi stack in the project that uses the code referenced in the issue.

it ALWAYS fails the first time we run pulumi up. normally once it fails with an error as shown above. However, when we rerun pulumi up, it then succeeds.

My guess is, the initial creation of the IAM role in our account and Splunk doesn't have the role in their IAM permissions just yet as Integration has role_arn which tells Splunk what role to assume. I'm thinking there is some delay on Splunk side while it sets the role, however pulumi or the provider is checking if its complete too soon.