hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.04k stars 4.12k forks source link

Vault hang when started for the first time #10034

Closed omerlh closed 1 year ago

omerlh commented 3 years ago

Describe the bug I installed Vault using the official chart, using GCP KMS seal and GCS backend.

Pod started as expected:

==> Vault server configuration:

      GCP KMS Crypto Key: <>
        GCP KMS Key Ring: <>
         GCP KMS Project: <>
          GCP KMS Region: global
             Api Address: http://<>:8200
                     Cgo: disabled
         Cluster Address: https://vault-0.vault-internal:8201
              Go Version: go1.14.7
              Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: trace
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: gcs (HA available)
                 Version: Vault v1.5.2
             Version Sha: 685fdfa60d607bca069c09d2d52b6958a7a2febd

But this is only thing that is written to log. Any operation with the API results in error:

vault status
Error reading key status: context deadline exceeded

Expected behavior Vault is starting Environment:

Vault server configuration file(s):

    disable_mlock = true
    ui = true
    log_level = "Trace"
    listener "tcp" {
      tls_disable = 1
      address = "[::]:8200"
      cluster_address = "[::]:8201"
      telemetry {
        unauthenticated_metrics_access = "true"
      }
    }
    storage "gcs" {
      bucket = "<>"
      ha_enabled = "true"
    }
    service_registration "kubernetes" {}
    # Example configuration for using auto-unseal, using Google Cloud KMS. The
    # GKMS keys must already exist, and the cluster must have a service account
    # that is authorized to access GCP KMS.
    seal "gcpckms" {
      project     = "<>"
      region      = "global"
      key_ring    = "<>"
      crypto_key  = "<>"
    }

    telemetry {
      prometheus_retention_time = "30s",
      disable_hostname = true
    }
raskchanky commented 3 years ago

Hi @omerlh

The context deadline exceeded makes me wonder if there's a connectivity issue with the server. What happens if you curl the seal status endpoint? e.g.

curl $VAULT_ADDR/v1/sys/seal-status

Does that succeed or hang? If it succeeds, what is the output?

omerlh commented 3 years ago

I'm running it inside the pod, even the health check endpoint is failing...

raskchanky commented 3 years ago

I'm not an expert at kubernetes, so forgive me if I get the terminology wrong. I'm just trying to see if the server is misconfigured somehow, if it's at all responsive to HTTP traffic of any form. I suspect this is a configuration issue and not a legit bug but I can't tell from your config file alone. Are you running this on IPv6?

omerlh commented 3 years ago

Look like some sort of timeout because now I am seeing a lot of those errors:

2020-09-24T16:30:09.935Z [INFO]  core: stored unseal keys supported, attempting fetch
2020-09-24T16:30:09.961Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2020-09-24T16:30:11.640Z [INFO]  core.autoseal: seal configuration missing, but cannot check old path as core is sealed: seal_type=recovery
2020-09-24T16:30:14.624Z [INFO]  core.autoseal: seal configuration missing, but cannot check old path as core is sealed: seal_type=recovery
2020-09-24T16:30:14.961Z [INFO]  core: stored unseal keys supported, attempting fetch
2020-09-24T16:30:14.990Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
2020-09-24T16:30:17.615Z [INFO]  core.autoseal: seal configuration missing, but cannot check old path as core is sealed: seal_type=recovery

Which is a lot easier to debug :)

pinglin commented 3 years ago

It might be the issue of tcp:8080 being blocked at the control plane so the webhook isn't functional. I encountered exactly the same symptom like yours on a private GKE cluster. This thread serves a very good explanation and has resolved my issue. See if it would help.

bsamuels453 commented 3 years ago

Thank you so much @pinglin, I was encountering the same issue and that fixed it.

Here's some additional indicators I saw:

queglay commented 3 years ago

I saw this problem too if running on EC2 in a private subnet with no outbound access - NAT disabled (not using Kubernetes, just AMI's on instances).

I was able to login, but could not run vault status, or other queries. Enabling NAT fixed the problem.

Is it possible to deploy Vault with no outbound access?

adv4000 commented 2 years ago

I have same issue, deployed Vault via Helm Chart and AWS KMS Autounseal, after deployment, login to one pod of vault and executed: vault operator init gave me message Error initializing: context deadline exceeded Next execute of vault operator init gave me message Vault is already initialized And Vault start working, but didn't got root/master token. Tried to execute first vault operator init > init.txt But file stay empty.

adv4000 commented 2 years ago

Fixed issue with this command export VAULT_CLIENT_TIMEOUT=300s Basically in k8s, vault initialization is very slow and default timeout of 60s not enough. After Vault deployed by Helm Chart execute the following:

export VAULT_CLIENT_TIMEOUT=300s
vault status            # Initialized: false, Sealed: true, RecoveryType: awskms
vault operator init     # will print tokens
vault status            # Initialized: true, Sealed: false, RecoveryType: shamir
Glastis commented 2 years ago

Thanks @adv4000, it worked in my environment once I edited my command with your fix:

kubectl exec vault-0 -- '/bin/sh' '-c' 'export VAULT_CLIENT_TIMEOUT=500s && vault operator init -key-shares=1 -key-threshold=1 -format=json' > cluster-keys.json

To store the key in a local file in host filesystem.

aphorise commented 1 year ago

These issues are almost always platform / infrastructure related. Hey @omerlh how did you progress and is this issue still applicable for you?

Maybe we'd want to get the related fix with VAULT_CLIENT_TIMEOUT documented and close? @Glastis @adv4000 any ideas? - a PR seems to be in order but I'm not sure where within the Kuberentes sections.

If no update or documentation suggestions then I vote that this be closed.

omerlh commented 1 year ago

I think I gave up on automating Vault first deployment, but it was pretty long time ago so I might be wrong about that :)

aphorise commented 1 year ago

It may be worth adding a note to the K8S docs about slower systems and longer initialisation times where an increases timeout may be required.

export VAULT_CLIENT_TIMEOUT=500s

Closing as there are no further follow-ups.