hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.98k stars 4.19k forks source link

Missing retry logic when using GCS backend might lead to cluster unavailable #26673

Open fcrespofastly opened 5 months ago

fcrespofastly commented 5 months ago

Describe the bug When acquiring the leadership, a Vault replica will start restoring leases. If a single restore fails because GCS is unavailable, then that vault instance gets sealed. In a fairly large environment with lots of leases there're chances where this can happen to all the vault replicas in a short period of time, causing the whole cluster to seal. This has happened to us in a production cluster with GCS returning just a few (5) 503 (which it's considered a retryable error and the golang client used in this vault version (v1.30.1) should retry by default unless the context is canceled which I couldn't spot it).

To Reproduce Hard to reproduce as it depends on a third party being unavailable.

Expected behavior The golang GCS client library will retry as it's considered a retryble operation and status code

Environment:

Vault server configuration file(s):

            api_addr     = "https://vault.{{ .Release.Namespace }}.svc:8200"
            cluster_addr = "https://$(POD_IP_ADDR):8201"

            log_level = "{{ .Values.logLevel }}"

            plugin_directory = "{{ .Values.pluginDirectory }}"

            ui = true

            seal "gcpckms" {
              project    = "{{ include "elvlib.cluster.gcpProjectId" $ }}"
              region     = "{{ include "elvlib.cluster.gcpRegion" $ }}"
              key_ring   = "{{ include "vault.config.keyRing" $ }}"
              crypto_key = "{{ .Values.cryptoKey }}"
            }

            storage "gcs" {
              bucket     = "{{ include "vault.config.bucket" $ }}"
              ha_enabled = "true"
            }

            telemetry "prometheus" {
              prometheus_retention_time = "60s"
              disable_hostname = true
              usage_gauge_period = "15s"
            }

            listener "tcp" {
              address       = "127.0.0.1:8200"
              tls_cert_file = "/etc/vault/tls/tls.crt"
              tls_key_file  = "/etc/vault/tls/tls.key"

              tls_disable_client_certs = true
            }

            listener "tcp" {
              address       = "$(POD_IP_ADDR):8200"
              tls_cert_file = "/etc/vault/tls/tls.crt"
              tls_key_file  = "/etc/vault/tls/tls.key"

              tls_disable_client_certs = true
              telemetry {
                unauthenticated_metrics_access = true
              }
            }

Additional context The golang client library for GCS supports retries by default. The response code (503) is a retryable error and the HTTP method (GET) is considered idempotent so it should be retrying unless the context is canceled which I could not spot it (and doesn't seem so as the Restore function error will be different as per the code:

defer func() {

        // Turn off restore mode. We can do this safely without the lock because

        // if restore mode finished successfully, restore mode was already

        // disabled with the lock. In an error state, this will allow the

        // Stop() function to shut everything down.

        atomic.StoreInt32(m.restoreMode, 0)

        switch {

        case retErr == nil:

        case strings.Contains(retErr.Error(), context.Canceled.Error()):

            // Don't run error func because we lost leadership

            m.logger.Warn("context canceled while restoring leases, stopping lease loading")

            retErr = nil

        case errwrap.Contains(retErr, ErrBarrierSealed.Error()):

            // Don't run error func because we're likely already shutting down

            m.logger.Warn("barrier sealed while restoring leases, stopping lease loading")

            retErr = nil

        default:

            m.logger.Error("error restoring leases", "error", retErr)

            if errorFunc != nil {

                errorFunc()

            }

        }

    }()
Google Cloud Support confirmed we're not receiving a lot of 503s and also our monitoring confirms the same, during the time window where the loop of:
for i in [0,1,2]:
- acquire lock
- restore leases
- fail to read lease with 503

We only see 3 503s, 1 per replica.

fcrespofastly commented 5 months ago

Update

I can see the context was canceled so I believe that's why the google storage go client is not retrying the requests.

My next step is to figure out the context cancelation process and how this impacted the entire Vault cluster.