Missing retry logic when using GCS backend might lead to cluster unavailable

Describe the bug When acquiring the leadership, a Vault replica will start restoring leases. If a single restore fails because GCS is unavailable, then that vault instance gets sealed. In a fairly large environment with lots of leases there're chances where this can happen to all the vault replicas in a short period of time, causing the whole cluster to seal. This has happened to us in a production cluster with GCS returning just a few (5) 503 (which it's considered a retryable error and the golang client used in this vault version (v1.30.1) should retry by default unless the context is canceled which I couldn't spot it).

To Reproduce Hard to reproduce as it depends on a third party being unavailable.

Expected behavior The golang GCS client library will retry as it's considered a retryble operation and status code

Environment:

Vault Server Version (retrieve with vault status): 1.15.4
Vault CLI Version (retrieve with vault version): v.1.15.6 (irrelevant)
Server Operating System/Architecture: Vault cluster 3 replicas running on kubernetes using GCS as the backend and kms to encrypt the keys

Vault server configuration file(s):

            api_addr     = "https://vault.{{ .Release.Namespace }}.svc:8200"
            cluster_addr = "https://$(POD_IP_ADDR):8201"

            log_level = "{{ .Values.logLevel }}"

            plugin_directory = "{{ .Values.pluginDirectory }}"

            ui = true

            seal "gcpckms" {
              project    = "{{ include "elvlib.cluster.gcpProjectId" $ }}"
              region     = "{{ include "elvlib.cluster.gcpRegion" $ }}"
              key_ring   = "{{ include "vault.config.keyRing" $ }}"
              crypto_key = "{{ .Values.cryptoKey }}"
            }

            storage "gcs" {
              bucket     = "{{ include "vault.config.bucket" $ }}"
              ha_enabled = "true"
            }

            telemetry "prometheus" {
              prometheus_retention_time = "60s"
              disable_hostname = true
              usage_gauge_period = "15s"
            }

            listener "tcp" {
              address       = "127.0.0.1:8200"
              tls_cert_file = "/etc/vault/tls/tls.crt"
              tls_key_file  = "/etc/vault/tls/tls.key"

              tls_disable_client_certs = true
            }

            listener "tcp" {
              address       = "$(POD_IP_ADDR):8200"
              tls_cert_file = "/etc/vault/tls/tls.crt"
              tls_key_file  = "/etc/vault/tls/tls.key"

              tls_disable_client_certs = true
              telemetry {
                unauthenticated_metrics_access = true
              }
            }

Additional context The golang client library for GCS supports retries by default. The response code (503) is a retryable error and the HTTP method (GET) is considered idempotent so it should be retrying unless the context is canceled which I could not spot it (and doesn't seem so as the Restore function error will be different as per the code:

defer func() {

        // Turn off restore mode. We can do this safely without the lock because

        // if restore mode finished successfully, restore mode was already

        // disabled with the lock. In an error state, this will allow the

        // Stop() function to shut everything down.

        atomic.StoreInt32(m.restoreMode, 0)

        switch {

        case retErr == nil:

        case strings.Contains(retErr.Error(), context.Canceled.Error()):

            // Don't run error func because we lost leadership

            m.logger.Warn("context canceled while restoring leases, stopping lease loading")

            retErr = nil

        case errwrap.Contains(retErr, ErrBarrierSealed.Error()):

            // Don't run error func because we're likely already shutting down

            m.logger.Warn("barrier sealed while restoring leases, stopping lease loading")

            retErr = nil

        default:

            m.logger.Error("error restoring leases", "error", retErr)

            if errorFunc != nil {

                errorFunc()

            }

        }

    }()

Google Cloud Support confirmed we're not receiving a lot of 503s and also our monitoring confirms the same, during the time window where the loop of:
for i in [0,1,2]:
- acquire lock
- restore leases
- fail to read lease with 503

We only see 3 503s, 1 per replica.

hashicorp / vault

Missing retry logic when using GCS backend might lead to cluster unavailable #26673

Update