pulumi / pulumi-awsx

AWS infrastructure best practices in component form!
https://www.pulumi.com/docs/guides/crosswalk/aws/
Apache License 2.0
221 stars 104 forks source link

30s delay for awsx.ecs.Fargate startup due to "error ECS was unable to assume the role" #927

Open lukehoban opened 1 year ago

lukehoban commented 1 year ago

In the last few awsx.ecs.FargateServices I've created, I've seen this in the ECS Service event log:

2022-10-13 16:34:12 -0700 service service-c8e196b has started 1 tasks: task 02e5835a0e294fb3864272e1e8e8e8ed.

2022-10-13 16:33:43 -0700 service service-c8e196b failed to launch a task with (error ECS was unable to assume the role 'arn:aws:iam::111111111111:role/service-task-6cba4ed' that was provided for this task. Please verify that the role being passed has the proper trust relationship and permissions and that your IAM user has permissions to pass this role.).

I don't recall ever seeing this with the classic AWSX provider. Is it possible we are not making the Service dependent on a policy being attached to the service role, such that the first attempt to do this fails? It appears this causes it to wait an additional 30s to retry, which materially increases the time to ready for the end to end deployment (4m21s vs. presumably 3m51s without this).

flostadler commented 1 week ago

The trust relationship is set when creating the role, so this shouldn't be caused by a missing dependency.

I think that this is caused by eventual consistency. IAM has a single global control plane in us-east-1 for the commercial partition, changes usually propagate within ~2 seconds to the other regions. So there's a chance that the role wasn't yet propagated into the target region by the time ECS tried to launch the task. We could fix this by adding a small create delay so that ECS doesn't end up in the 30s retry timeout.