Open gygitlab opened 2 years ago
Hi @grantyoung. We should already retry when APIs return this error. Would you mind sharing your debug log?
We run into this in our nightly tests a lot. It's definitely not service-specific, I'm gonna reclassify as provider-wide.
Discussion from triage: a possible way to fix this issue is to implement a retry in the provider
We already have a retry in the provider for exactly this case: https://github.com/hashicorp/terraform-provider-google/blob/3cfedbb9f6e8b020ed3ff94179ce631ee92aefc2/google/transport/error_retry_predicates.go#L120 Based on test logs it looks like the retry gets called repeatedly throughout a test - presumably until some limit is hit (hopefully not a timeout). I'll look into whether it's possible to add backoff and jitter if those aren't already present, or increase the number of retries / timeout.
I noticed in TestAccComputeInstance_resourcePolicyUpdate (in this execution) we're hitting a context deadline really early. 2m15s instead of 20m or so. Maybe we're attaching a short one to the retry transport?
Tentatively assigning test-failure-10
Community Note
modular-magician
user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned tohashibot
, a community member has claimed the issue already.Noticed recently we were getting some "'subnetworks/default' is not ready" errors in our Terraform runs on new environments:
We haven't seen this before but the reason I think this is happening is we have both VMs as well as a GKE cluster being created here at the same time.
When the GKE Cluster starts to be created it's coming in and adding it's additional ranges to the target subnet and that in turn is then "locking" the subnet for sometime - preventing other resources from being created against it.
Rerunning the apply works fine so it feels like this could be handle more gracefully by either doing some retries or holding the cluster build until other resources are done? We could workaround this on our end but at the same time it doesn't seem unreasonable to deploy both a Cluster and other resources to the same subnet so thought it was worth raising.
Terraform Version
1.1.4
Affected Resource(s)
Terraform Configuration Files
Expected Behavior
The provider should gracefully handle any timing clashes caused by the Cluster when on the same subnet
Actual Behavior
The provider creates the Cluster at the same time as VMs. The VMs as a result get a 400 error from the API as the Cluster edits the subnet to add in more ranges.
Steps to Reproduce
Important Factoids
References
10585 - Similar issue
b/300616739