hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.79k stars 4.17k forks source link

Blank screen / partial load issue during Vault upgrades #27770

Open fancybear-dev opened 1 month ago

fancybear-dev commented 1 month ago

Describe the bug When upgrading Vault, there is intermittent downtime for the user as the UI sometimes does not load. There are no network issues in the browser, rather console errors. After the upgrade, the issue is also gone.

To Reproduce Steps to reproduce the behavior:

  1. Setup a Vault cluster of 3 or 5 with raft configured using e.g. version 1.17.0
  2. Upgrade to 1.17.1
  3. F5 until you see the issue occur intermittently

The errors match this old issue: https://github.com/hashicorp/vault/issues/13357

Sometimes there are random exceptions, the console errors are slightly different. When these different errors occur, sometimes it's not entirely blank but e.g. loads very ugly with badly sized logo etc.

image

There are a few critical notes here. We use a Google MIG to distribute the nodes between at max 3 zones. This means an upgrade cycle of -> 3 old -> 3 old, 2 new -> 1 old, 3 new -> 3 new. The loadbalancer is configured to route with -> /v1/sys/health?activecode=200&standbycode=200. This means either leader or standby node (via redirection) serves the webrequests.

Our best hypothesis thus far, is that it somehow is because of how we route requests - but given the past issue, this could also be Vault itself. Good thing to know, this was not the case before. We're unsure since which version this started, but we have used Vault since version 1.15 in this configuration. We are however certain this has been the case for at least this YTD (1 jan 2024).

This happens on Brave and Chrome, on the latest OSX, Ubuntu 22.04 LTS and Windows 11 canary releases.

Expected behavior No downtime during upgrades.

Environment: We run on 1.17.0 when upgrading to 1.17.1. It's fully docker based using COS images of Google.

Vault server configuration file(s):

      ui = true
      disable_mlock = true

      storage "raft" {
        path = "/vault/file"
        node_id = "NODE_ID"
        retry_join {
          auto_join               = "provider=gce tag_value=${resource_name_prefix}-vault zone_pattern=${zone_pattern}"
          auto_join_scheme        = "https"
          leader_tls_servername   = "${leader_tls_servername}"
          leader_ca_cert_file     = "/vault/certs/vault-ca.pem"
          leader_client_cert_file = "/vault/certs/vault-cert.pem"
          leader_client_key_file  = "/vault/certs/vault-key.pem"
        }
      }

      cluster_addr  = "https://LOCAL_IPV4:8201"
      api_addr      = "https://LOCAL_IPV4:8200"

      listener "tcp" {
        address            = "0.0.0.0:8200"
        tls_disable        = false
        tls_cert_file      = "/vault/certs/vault-cert.pem"
        tls_key_file       = "/vault/certs/vault-key.pem"
        tls_client_ca_file = "/vault/certs/vault-ca.pem"
      }

      seal "gcpckms" {
        project    = "${project}"
        region     = "${kms_region}"
        key_ring   = "${key_ring}"
        crypto_key = "${crypto_key}"
      }

      telemetry {
        stackdriver_project_id = "${project}"
        stackdriver_location = "${telemetry_region}"
        stackdriver_namespace = "${resource_name_prefix}-vault"
        disable_hostname = true
        enable_hostname_label = true
        filter_default = false
        prefix_filter = ["+vault.token", "-vault.expire", "+vault.expire.num_leases", "+vault.expire.renew-token", "+vault.audit", "+vault.core.leadership_lost", "+vault.core.leadership_setup_failed", "+vault.core.handle_login_request", "+vault.core.handle_request", "+vault.core.post_unseal"]
      }

      default_lease_ttl = "1h"
      max_lease_ttl     = "3h"

Additional context Add any other context about the problem here.

fancybear-dev commented 1 month ago

did more testing, we have more and more reason to believe this is indeed GCP routing related - and not Vault.

fancybear-dev commented 1 month ago

We've found the root cause. The files that the browser complains about regarding mime type, are in reality silent 404 errors. Our load balancer sends requests round robin to any health node that is unsealed and initialized. During an upgrade this means requests are send to both version a ánd version b. When the UI of version a loads, it also expects scripts of version a - otherwise it will fail. We were under the (wrongful apparently) impression this would not be an issue given the request redirection of standby nodes. Only routing to the leader is also not an option, as the detection of the new leader is too slow by the cloud native load balancer we use. What we're currently testing is to force routing to version a, until all version a is replaced in which case it automatically routes everything to version b. This, in theory, should result in our intended behaviour of no UI downtime during upgrades.

I must say I find this behaviour to be odd. This hasn't been the case before, and can certainly be resolved in the Vault code. Are there any Hashicorp recommendations or best practices? Do you classify this is as an issue / bug? It does make it harder implementation wise for engineers to ensure proper routing.

updated: we settled on introducing cookie based sessions affinity, this ensures requests per session are always send to the same node in the cluster. this still introduces downtime, but during our testing as minimal as we could get it compared to various alternatives. Literally a single f5 resolves it (as it then routes to a new node, after the one is used went offline due to upgrade).