Open jgrobbel opened 3 months ago
Hi @jgrobbel !
Provide us the terraform code(resource(s) code, environment variables, locals, etc. You are using) to reproduce this issue and see what is happening
This really feels like an API issue where they're returning an incorrect error code. 429 is "Too many requests" which is certainly not what is being returned
@ggtisc lets forward this to the service team and see what they have to say
@ggtisc lets forward this to the service team and see what they have to say
Thank you for looking into it. Here are the GCP docs on Error code from Quota Exhaustion: https://cloud.google.com/docs/quotas/troubleshoot#exceeding_quota_values
Looks like a mix of 429 and gRPC error codes, which is quite confusing. The API should probably be returning 403 QUOTA_EXCEEDED
in this situation.
Although I note 403 RATE_LIMIT_EXCEEDED
is also an option. So this GCP terraform provider probably needs to look at both code
and status
to decide if the request is retry-able.
So this GCP terraform provider probably needs to look at both code and status to decide if the request is retry-able.
We've been down this path and it only makes things more brittle. It's possible, but if there are other options to return a more correct code those should be explored.
Thank you for the details. I forwarded the internal version of this issue to the BigQuery Analytics Hub team to get more details.
Thank you for the details. I forwarded the internal version of this issue to the BigQuery Analytics Hub team to get more details.
@wj-chen Any updates to share? This is quite a problem for us to handle without extra logic, given this fails silently.
@jgrobbel There is a new update from today (August 5) -
In this case, we should change the error code to FAILED_PRECONDITION as the problem here is similar to eg mentioned in grpc docs: 'if an "rmdir" fails because the directory is non-empty, FAILED_PRECONDITION should be returned since the client should not retry unless the files are deleted from the directory.' In our case, client should not retry unless some Data Exchange or Listing is deleted to make up space for new one or limit for these should be updated. I will change the error code from RESOURCE_EXHAUSTED to FAILED_PRECONDITION.
I checked that the Data Exchange resource in Terraform uses the default retry predicates which will treat FAILED_PRECONDITION
as non-retriable so the error code mapping update should solve this issue: https://github.com/hashicorp/terraform-provider-google/blob/64b124a947da3d1e867840fbcb965a0eb3233310/google/transport/error_retry_predicates.go#L25.
API-side changes may take some time to be released and I will update here once there is progress.
@jgrobbel There is a new update from today (August 5) -
In this case, we should change the error code to FAILED_PRECONDITION as the problem here is similar to eg mentioned in grpc docs: 'if an "rmdir" fails because the directory is non-empty, FAILED_PRECONDITION should be returned since the client should not retry unless the files are deleted from the directory.' In our case, client should not retry unless some Data Exchange or Listing is deleted to make up space for new one or limit for these should be updated. I will change the error code from RESOURCE_EXHAUSTED to FAILED_PRECONDITION.
I checked that the Data Exchange resource in Terraform uses the default retry predicates which will treat
FAILED_PRECONDITION
as non-retriable so the error code mapping update should solve this issue:. API-side changes may take some time to be released and I will update here once there is progress.
Thank you for the update. I look forward to hearing when the API changes are released.
Community Note
Terraform Version & Provider Version(s)
User-Agent: Terraform/1.4.5 (+https://www.terraform.io) Terraform-Plugin-SDK/2.33.0 terraform-provider-google/5.23.0
Affected Resource(s)
Bigquery Analytics Hub DataExchange (but likely others)
tf_provider_addr=registry.terraform.io/hashicorp/google tf_resource_type=google_bigquery_analytics_hub_data_exchange tf_rpc=ApplyResourceChange
Terraform Configuration
This is hitting GCP quota on the number of Dataexchanges per project per region which is 500 by default.
Debug Output
Expected Behavior
If we are out of quota in some resource, we should create the other resources from the stack and update the state file. And then error out gracefully. Instead terraform remained in an a retry loop and never updated the statefile.
This then leads to further conflict errors when it attempted to re-run
N.B. This time it correctly does not retry.
Actual Behavior
If you look at the debug snippet, the return code is indeed 429, but the status is not "Too Many Requests". That would indeed be some API rate limit error worth retrying after a few seconds.
In this case resource is out of quota (RESOURCE_EXHAUSTED) and should not be retried.
Steps to reproduce
I think visually the bug is pretty clear, the error code is not mapped properly. If you wanted to repro I would lower the quota to like 10, and then create 11 DataExchange resources.
Important Factoids
No response
References
No response
b/347077253