Open fcrespofastly opened 5 months ago
I can see the context was canceled so I believe that's why the google storage go client is not retrying the requests.
My next step is to figure out the context cancelation process and how this impacted the entire Vault cluster.
Describe the bug When acquiring the leadership, a Vault replica will start restoring leases. If a single restore fails because GCS is unavailable, then that vault instance gets sealed. In a fairly large environment with lots of leases there're chances where this can happen to all the vault replicas in a short period of time, causing the whole cluster to seal. This has happened to us in a production cluster with GCS returning just a few (5) 503 (which it's considered a retryable error and the golang client used in this vault version (v1.30.1) should retry by default unless the context is canceled which I couldn't spot it).
To Reproduce Hard to reproduce as it depends on a third party being unavailable.
Expected behavior The golang GCS client library will retry as it's considered a retryble operation and status code
Environment:
vault status
): 1.15.4vault version
): v.1.15.6 (irrelevant)Vault server configuration file(s):
Additional context The golang client library for GCS supports retries by default. The response code (503) is a retryable error and the HTTP method (GET) is considered idempotent so it should be retrying unless the context is canceled which I could not spot it (and doesn't seem so as the Restore function error will be different as per the code:
We only see 3 503s, 1 per replica.