pulumi / pulumi-aws

An Amazon Web Services (AWS) Pulumi resource package, providing multi-language access to AWS
Apache License 2.0
460 stars 155 forks source link

Hang on getCallerIdentity when running pulumi up on stack with s3 bucket #2371

Closed cowlabs-xyz closed 1 month ago

cowlabs-xyz commented 1 year ago

What happened?

When deploying a known working stack without any changes through pulumi up the process will hang indefinitely in preview on: I0216 13:22:00.181430 88382 log.go:71] eventSink::Debug(<{%reset%}>Registering resource: t=aws:s3/bucketObject:BucketObject, name=static-unit/dist/assets/libs/bootstrap-icons/icons-bug-fill.svg, custom=true, remote=false<{%reset%}>) I0216 13:22:00.343043 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.342997 88452 schema.go:864] Terraform output arn = {arn:aws:iam::REDACTED:REDACTED} <{%reset%}>) I0216 13:22:00.343084 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343009 88452 schema.go:864] Terraform output userId = {REDACTED} <{%reset%}>) I0216 13:22:00.343101 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343011 88452 schema.go:864] Terraform output id = {REDACTED} <{%reset%}>) I0216 13:22:00.343113 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343013 88452 schema.go:864] Terraform output accountId = {REDACTED} <{%reset%}>) I0216 13:22:00.343121 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343018 88452 rpc.go:74] Marshaling property for RPC[tf.Provider[aws].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: accountId={REDACTED} <{%reset%}>) I0216 13:22:00.343132 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343021 88452 rpc.go:74] Marshaling property for RPC[tf.Provider[aws].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: arn={arn:aws:iam::REDACTED:REDACTED} <{%reset%}>) I0216 13:22:00.343139 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343023 88452 rpc.go:74] Marshaling property for RPC[tf.Provider[aws].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: id={REDACTED} <{%reset%}>) I0216 13:22:00.343145 88382 log.go:71] eventSink::Infoerr(<{%reset%}>I0216 13:22:00.343024 88452 rpc.go:74] Marshaling property for RPC[tf.Provider[aws].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: userId={REDACTED} <{%reset%}>) I0216 13:22:00.343162 88382 log.go:71] Unmarshaling property for RPC[Provider[aws, 0x140004f2eb0].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: accountId={REDACTED} I0216 13:22:00.343173 88382 log.go:71] Unmarshaling property for RPC[Provider[aws, 0x140004f2eb0].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: arn={arn:aws:iam::REDACTED:REDACTED} I0216 13:22:00.343179 88382 log.go:71] Unmarshaling property for RPC[Provider[aws, 0x140004f2eb0].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: id={REDACTED} I0216 13:22:00.343182 88382 log.go:71] Unmarshaling property for RPC[Provider[aws, 0x140004f2eb0].Invoke(aws:index/getCallerIdentity:getCallerIdentity).returns]: userId={REDACTED} I0216 13:22:00.343187 88382 log.go:71] Provider[aws, 0x140004f2eb0].Invoke(aws:index/getCallerIdentity:getCallerIdentity) success (#ret=4,#failures=0) success I0216 13:22:00.343194 88382 log.go:71] Marshaling property for RPC[ResourceMonitor.Invoke(aws:index/getCallerIdentity:getCallerIdentity)]: accountId={REDACTED} I0216 13:22:00.343198 88382 log.go:71] Marshaling property for RPC[ResourceMonitor.Invoke(aws:index/getCallerIdentity:getCallerIdentity)]: arn={arn:aws:iam::REDACTED:REDACTED} I0216 13:22:00.343202 88382 log.go:71] Marshaling property for RPC[ResourceMonitor.Invoke(aws:index/getCallerIdentity:getCallerIdentity)]: id={REDACTED} I0216 13:22:00.343205 88382 log.go:71] Marshaling property for RPC[ResourceMonitor.Invoke(aws:index/getCallerIdentity:getCallerIdentity)]: userId={REDACTED} ---- HANGS HERE ----

Expected Behavior

Expect either preview to complete or error message to be shown.

Steps to reproduce

  1. Run pulumi up on known working stack
  2. Experience hang
  3. region is eu-north-1

Output of pulumi about

CLI
Version 3.55.0 Go Version go1.19.5 Go Compiler gc

Plugins NAME VERSION aws 5.29.1 aws 5.10.0 command 0.5.2 docker 3.6.1 eks 0.42.7 kubernetes 3.22.2 kubernetes 3.20.2 kubernetes-cert-manager 0.0.3 nodejs unknown

Host
OS darwin Version 13.1 Arch arm64

This project is written in nodejs: executable='/opt/homebrew/opt/node@16/bin/node' version='v16.19.0'

Backend
Name pulumi.com URL https://app.pulumi.com/kimdanielarthur-alpinex User kimdanielarthur-alpinex Organizations kimdanielarthur-alpinex

Dependencies: NAME VERSION @pulumi/command 0.5.2 @pulumi/kubernetesx 0.1.6 patch-package 6.5.0 simple-sha256 1.1.0 cdk8s-cli 2.1.63 multimap 1.1.0 @pulumi/awsx 0.40.1 @pulumi/kubernetes-cert-manager 0.0.3 @pulumi/pulumi 3.48.0 @types/uuid 8.3.4 requestretry 7.1.0 @types/node 16.18.4 axios 0.27.2 @pulumi/aws 5.29.1 @pulumi/eks 0.42.7 @pulumi/kubernetes 3.22.2 @types/multimap 1.1.2

Pulumi locates its logs in /var/folders/l0/66wv34vs4hq4lpd4b6yk60k40000gn/T/ by default

Additional context

It seems to be related to an s3 bucket. When i remove this from the configuration it is able to proceed with the preview completely.

The steps I have taken to try to get passed this:

  1. Update pulumi to latest
  2. Update aws cli to latest
  3. Create new aws access token and aws configure
  4. pulumi config set aws:skipRequestingAccountId true
  5. pulumi config set aws:skipMetadataApiCheck true
  6. pulumi config set aws:skipCredentialsValidation true
  7. pulumi refresh <- runs to completion
  8. aws sts get-caller-identity <- returns as expected OK
  9. rm ~/.aws/credentials
  10. checked I have not conflicting env vars for aws tokens
  11. pulumi logout and login
  12. Able to create s3 bucket directly using aws s3api through cli
  13. deleted and reinstalled pulumi aws plugin
  14. created a completely new AWS user
  15. export and re-import stack
  16. Remove s3 bucket from configuration <- preview completes fully
  17. If i make a new stack I am able to have pulumi createa a new s3 bucket
  18. If I make a new s3 bucket in the same stack that is failing it is not able to create this s3 bucket <- it still hangs indefinetly
  19. validating that sts.eu-north-1.amazonaws.com resolves correctly in dns
  20. changed dns servers to see if timing or timeout issue

Are there any tips for further debug or actions to get past this stuck deployment?

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

squaremo commented 1 year ago

Thanks for the detailed log and reproduction notes, very helpful!

This is reminscent of other problems we've seen with the NodeJS SDK (and pulumi-aws is mentioned, too): https://github.com/pulumi/pulumi/issues/12168 is the current one.

Are you able to provide a (minimal) program which shows the problem? Is it as simple as "create an S3 bucket, then call getCallerIdentity" -- if so, I can try to reproduce it here.

cowlabs-xyz commented 1 year ago

Thanks for reply!

There is a similarity to my issue here. I had over 2500 s3 BucketObjects.

In terms of standalone reproducability:

In the end i decided to not manage my S3 bucket with pulumi at all as it seems inefficient with all that overhead just to sync some static files to an s3 bucket.

So the only way I could make my stack deployable again was to remove the s3 bucket deployment from the stack.

Sorry that I cannot help with any further debugging, but there seems to be something lurking around that behaviour. Maybe some promises that never resolve and fail to trigger callback when there are many s3 BucketObjects?

Kim

squaremo commented 1 year ago

Sorry that I cannot help with any further debugging, but there seems to be something lurking around that behaviour. Maybe some promises that never resolve and fail to trigger callback when there are many s3 BucketObjects?

It's all more clues :-)

Since you're not using Pulumi for the S3 bucket and its objects, does that mean you're not blocked on this issue? (Knowing that will help us prioritise)

cowlabs-xyz commented 1 year ago

Yes you are right, I unblocked by removing the s3 from this stack :)

Jeff-Tian commented 11 months ago

I met the same hang issue caused by some null value set. By removing the null value settings, it's OK now.