pulumi / pulumi-gcp

A Google Cloud Platform (GCP) Pulumi resource package, providing multi-language access to GCP
Apache License 2.0
183 stars 53 forks source link

TestAccBucket is flaky #1267

Closed rquitales closed 1 year ago

rquitales commented 1 year ago

What happened?

The TestAccBucket has high failure rates while occasionally passing. The test creates a Cloud Storage bucket and 4 CloudFunctions to listen to the bucket. We get the following error when running the test:

Error waiting for Creating CloudFunctions Function: Error code 13, message: Failed to configure trigger PubSub projects/md25dca276651e46a-tp/topics/cloud-functions-b3ms5le5w3igisffbqbsdwu3by

This occurs on one or more of the defined callback functions. When running the test simultaneously locally, I was able to get 3 passing runs out of 20 total runs. This is a 15% success rate.

We should disable this test to unblock CI for now while investigating the root cause.

Example

Example failed run: https://github.com/pulumi/pulumi-gcp/actions/runs/6493385756/job/17663277513 Example passing run: https://github.com/pulumi/pulumi-gcp/actions/runs/6486428609/job/17614962663

Output of pulumi about

NA

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

t0yv0 commented 1 year ago

Thanks for logging this @rquitales !

I am wondering here if there is some contention in play because the resource names are insufficiently random or perhaps our provider is missing an opportunity to properly retry/backoff.

rquitales commented 1 year ago

I have a feeling it's the second option here having done a quick look into the code base earlier - but would really need to dive deeper to figure out what is happening. The code 13 error from GCP is also quite broad and non-informative to what's really happening.

mikhailshilkov commented 1 year ago

@rquitales We have a policy to mark any disabled tests as a p1, and especially given we are now working to launch GCP v7. Could you please investigate it further?

rquitales commented 1 year ago

It appears that Cloud Storage buckets are eventually consistent, thus attempting to create a Cloud Function with the bucket as a trigger almost immediately after creating a bucket can result in a failed function deployment.

A consistent way that I've discovered that solves this is to re-run pulumi up whenever the Cloud Function fails to deploy. The second pulumi up invocation resulted in successful function deployment out of the 20 test runs attempted.