iam.Role is not always available upon creation and seems to be eventually consistent

graeson commented 2 years ago

Hello!

Vote on this issue by adding a 👍 reaction
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already)

Issue details

I am trying to deploy a container image from a private AWS ECR repository to AWS App Runner using Pulumi. The Pulumi code only creates two resources: an IAM role and an App Runner service. On first execution of pulumi up the IAM role is created successfully, but App Runner throws an error stating it can't assume the role.

error creating App Runner Service (<name>): 
InvalidRequestException: Error in assuming access role <arn:aws:iam>

On second execution of pulumi up the service assumes the role, downloads from ECR and deploys to AppRunner successfully. To diagnose the issue, I looked through Pulumi output generated with pulumi up --logtostderr -v=9 2> out.txt and CloudTrail logs, but was not able to find any additional information about root cause. As a sanity check, I tried recreating the same resources using CloudFormation and it works without issue. Finally, I tried using opt: to explicitly establish a dependsOn between the service and role, but that didn't make a difference.

Steps to reproduce

Create a pulumi python project with two resources: IAM role & App Runner Service

import json
import pulumi
import pulumi_aws as aws

role = aws.iam.Role(
    "aws-iam-role",
    assume_role_policy = json.dumps(
        {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "Service": "build.apprunner.amazonaws.com"
              },
              "Action": "sts:AssumeRole"
            }
          ]
        }
    ),
    managed_policy_arns = [
        "arn:aws:iam::aws:policy/service-role/AWSAppRunnerServicePolicyForECRAccess",
    ],
)

app = aws.apprunner.Service("app"
    service_name = "hello",
    source_configuration = aws.apprunner.ServiceSourceConfigurationArgs(
        authentication_configuration = aws.apprunner.ServiceSourceConfigurationAuthenticationConfigurationArgs(
            access_role_arn = role.arn,
        ),
        image_repository = aws.apprunner.ServiceSourceConfigurationImageRepositoryArgs(
            image_configuration = aws.apprunner.ServiceSourceConfigurationImageRepositoryImageConfigurationArgs(
                port = 5000,
            ),
            image_identifier = image,
            image_repository_type = "ECR",
        ),
    ),
)

Set image_identifier to a valid, ECR image URI
Run pulumi up to see error
Run pulumi up again to deploy successfully

Expected: App Runner to assume the IAM role, download image from ECR and deploy to App Runner on the first execution of pulumi up.

Actual: App Runner was unable to assume IAM role on first pulumi up and failed with "InvalidRequestException: Error in assuming access role". On second execution of pulumi up I get the expected behavior.

leezen commented 2 years ago

Given the second one works, I suspect an eventual consistency issue and could be due to upstream. As a potential workaround, you could try something along the lines of access_role_arn = role.arn.apply(lambda arn: time.sleep(10) or arn)

graeson commented 2 years ago

I tried using time.sleep(10) as per your suggestion and it worked on first pass. Just out a curiosity, I experimented with increasingly lower sleep times and it works consistently with time.sleep(4). With 3 seconds it fails intermittently and with 2 seconds it fails consistently. Thanks for your help @leezen!

leezen commented 2 years ago

@graeson Thanks for confirming. That's unfortunate that this does indicate an issue w/ the underlying upstream code resulting in the workaround, but at least it sounds like you're unblocked for now. I'm going to change the title of this issue to reflect that new understand and keep the issue open for tracking.

simonknittel commented 2 years ago

For anyone using Terraform stumbling about this: Same thing happens with Terraform's AWS provider. Creating the access role and immediately creating the service afterwards results in that same error. I've successfully used this provider to introduce this artifical delay between those two resources: https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep

kdryetyln commented 2 years ago

While creating the role and the apprunner in the same tf file, I think it gets an error because the app runner is tried to be created without all the definitions of the role. I solved it with sleep too. I solved it as follows. The duration can be changed. However, I set it to 60 sec.

resource "aws_iam_role" "myrole" { name = "myrole" assume_role_policy = jsonencode({ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "build.apprunner.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }) }

resource "aws_iam_role_policy_attachment" "myrolepolicy" { role = aws_iam_role.myrole.id policy_arn = "arn:aws:iam::aws:policy/service-role/AWSAppRunnerServicePolicyForECRAccess" }

resource "time_sleep" "waitrolecreate" { depends_on = [aws_iam_role.myrole] create_duration = "60s" }

resource "aws_apprunner_service" "my-app-runner" { depends_on = [time_sleep.waitrolecreate] service_name = "my-app-runner" . . . .

pulumi / pulumi-aws