API response code 429 "RESOURCE_EXHAUSTED" incorrectly classified as retryable

jgrobbel commented 3 months ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to a user, that user is claiming responsibility for the issue.
Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

User-Agent: Terraform/1.4.5 (+https://www.terraform.io) Terraform-Plugin-SDK/2.33.0 terraform-provider-google/5.23.0

Affected Resource(s)

Bigquery Analytics Hub DataExchange (but likely others)

tf_provider_addr=registry.terraform.io/hashicorp/google tf_resource_type=google_bigquery_analytics_hub_data_exchange tf_rpc=ApplyResourceChange

Terraform Configuration

This is hitting GCP quota on the number of Dataexchanges per project per region which is 500 by default.

Debug Output

2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: 2024/06/12 14:49:15 [DEBUG] Dismissed an error as retryable. Retryable error code 429 - googleapi: got HTTP response code 429 with body: HTTP/2.0 429 Too Many Requests
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: Cache-Control: private
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: Content-Type: application/json; charset=UTF-8
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: Date: Wed, 12 Jun 2024 13:49:15 GMT
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: Server: scaffolding on HTTPServer2
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: Vary: Origin
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: Vary: X-Origin
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: Vary: Referer
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: X-Frame-Options: SAMEORIGIN
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: X-Xss-Protection: 0
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5:
2024-06-12T14:49:15.401+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: {
2024-06-12T14:49:15.402+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5:   "error": {
2024-06-12T14:49:15.402+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5:     "code": 429,
2024-06-12T14:49:15.402+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5:     "message": "Number of Data Exchanges exceeds the per-project limit (500).",
2024-06-12T14:49:15.402+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5:     "status": "RESOURCE_EXHAUSTED"
2024-06-12T14:49:15.402+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5:   }
2024-06-12T14:49:15.402+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: }
2024-06-12T14:49:15.402+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: 2024/06/12 14:49:15 [DEBUG] Retry Transport: Waiting 27.5s before trying request again

Expected Behavior

If we are out of quota in some resource, we should create the other resources from the stack and update the state file. And then error out gracefully. Instead terraform remained in an a retry loop and never updated the statefile.

This then leads to further conflict errors when it attempted to re-run

       "error": {
         "code": 409,
         "message": "Already Exists: Dataset <redacted>_us_central1",
    2       {
             "message": "Already Exists: Dataset <redacted>_us_central1",
             "domain": "global",
             "reason": "duplicate"
           }
         ],
         "status": "ALREADY_EXISTS"
       }
     }
     2024-06-12T14:48:30.856+0100 [DEBUG] provider.terraform-provider-google_v5.23.0_x5: 2024/06/12 14:48:30 [DEBUG] Retry Transport: Stopping retries, last request failed with non-retryable error: googleapi: got HTTP response code 409 with body: HTTP/2.0 409 Conflict

N.B. This time it correctly does not retry.

Actual Behavior

If you look at the debug snippet, the return code is indeed 429, but the status is not "Too Many Requests". That would indeed be some API rate limit error worth retrying after a few seconds.

In this case resource is out of quota (RESOURCE_EXHAUSTED) and should not be retried.

Steps to reproduce

I think visually the bug is pretty clear, the error code is not mapped properly. If you wanted to repro I would lower the quota to like 10, and then create 11 DataExchange resources.

  "resource": {
    "google_bigquery_analytics_hub_data_exchange": {
      "testbigquerystack_dataExchange_1ADE7FAA": {
        "//": {
          "metadata": {
            "path": "test/dataExchange",
            "uniqueId": "test_dataExchange_1ADE7FAA"
          }
        },
        "data_exchange_id": "${var.data_exchange}",
        "display_name": "${var.data_exchange}",
        "lifecycle": {
          "ignore_changes": [
            "data_exchange_id",
            "display_name"
          ],
          "prevent_destroy": true
        },
        "location": "${var.multiRegion}",
        "project": "${var.gcp_project_id}"
      }
    },

Important Factoids

No response

References

No response

b/347077253

ggtisc commented 3 months ago

Hi @jgrobbel !

Provide us the terraform code(resource(s) code, environment variables, locals, etc. You are using) to reproduce this issue and see what is happening

slevenick commented 3 months ago

This really feels like an API issue where they're returning an incorrect error code. 429 is "Too many requests" which is certainly not what is being returned

slevenick commented 3 months ago

@ggtisc lets forward this to the service team and see what they have to say

jgrobbel commented 3 months ago

@ggtisc lets forward this to the service team and see what they have to say

Thank you for looking into it. Here are the GCP docs on Error code from Quota Exhaustion: https://cloud.google.com/docs/quotas/troubleshoot#exceeding_quota_values

Looks like a mix of 429 and gRPC error codes, which is quite confusing. The API should probably be returning 403 QUOTA_EXCEEDED in this situation.

Although I note 403 RATE_LIMIT_EXCEEDED is also an option. So this GCP terraform provider probably needs to look at both code and status to decide if the request is retry-able.

slevenick commented 3 months ago

So this GCP terraform provider probably needs to look at both code and status to decide if the request is retry-able.

We've been down this path and it only makes things more brittle. It's possible, but if there are other options to return a more correct code those should be explored.

wj-chen commented 3 months ago

Thank you for the details. I forwarded the internal version of this issue to the BigQuery Analytics Hub team to get more details.

jgrobbel commented 2 months ago

Thank you for the details. I forwarded the internal version of this issue to the BigQuery Analytics Hub team to get more details.

@wj-chen Any updates to share? This is quite a problem for us to handle without extra logic, given this fails silently.

wj-chen commented 1 month ago

@jgrobbel There is a new update from today (August 5) -

In this case, we should change the error code to FAILED_PRECONDITION as the problem here is similar to eg mentioned in grpc docs: 'if an "rmdir" fails because the directory is non-empty, FAILED_PRECONDITION should be returned since the client should not retry unless the files are deleted from the directory.' In our case, client should not retry unless some Data Exchange or Listing is deleted to make up space for new one or limit for these should be updated. I will change the error code from RESOURCE_EXHAUSTED to FAILED_PRECONDITION.

I checked that the Data Exchange resource in Terraform uses the default retry predicates which will treat FAILED_PRECONDITION as non-retriable so the error code mapping update should solve this issue: https://github.com/hashicorp/terraform-provider-google/blob/64b124a947da3d1e867840fbcb965a0eb3233310/google/transport/error_retry_predicates.go#L25.

API-side changes may take some time to be released and I will update here once there is progress.

jgrobbel commented 1 month ago

@jgrobbel There is a new update from today (August 5) -

In this case, we should change the error code to FAILED_PRECONDITION as the problem here is similar to eg mentioned in grpc docs: 'if an "rmdir" fails because the directory is non-empty, FAILED_PRECONDITION should be returned since the client should not retry unless the files are deleted from the directory.' In our case, client should not retry unless some Data Exchange or Listing is deleted to make up space for new one or limit for these should be updated. I will change the error code from RESOURCE_EXHAUSTED to FAILED_PRECONDITION.

I checked that the Data Exchange resource in Terraform uses the default retry predicates which will treat FAILED_PRECONDITION as non-retriable so the error code mapping update should solve this issue:

https://github.com/hashicorp/terraform-provider-google/blob/64b124a947da3d1e867840fbcb965a0eb3233310/google/transport/error_retry_predicates.go#L25

. API-side changes may take some time to be released and I will update here once there is progress.

Thank you for the update. I look forward to hearing when the API changes are released.

hashicorp / terraform-provider-google