hashicorp / vault-helm

Helm chart to install Vault and other associated components.
Mozilla Public License 2.0
1.08k stars 873 forks source link

Default Timeout Settings on the Helm Chart #901

Open td4b opened 1 year ago

td4b commented 1 year ago

Is your feature request related to a problem? Please describe. Cluster communication timeout's are too tight by default when bootstrapping the Vault cluster in helm/k8s.

e.g. Running vault operator init

Describe the solution you'd like Seems that EKS 1.24 (from 1.21) adds additional network latency within the cluster. https://support.hashicorp.com/hc/en-us/articles/8552873602451-Vault-on-Kubernetes-and-context-deadline-exceeded-errors

What is interesting is that this isn't a default setting in the helm chart (which it should be) to account for the increased latency between versions.

Adding it here fixed the issue entirely, and when the vault gets unsealed, the keys are properly outputted to the CLI without timeout.

set {
    name = "server.extraEnvironmentVars.VAULT_CLIENT_TIMEOUT"
    value = "300s"
  }

I am thinking that increasing the timeout may help account for network latency in k8s/ eks.

Full chart settings that worked:

resource "helm_release" "vault" {
  name       = "vault"
  repository = "https://helm.releases.hashicorp.com"
  chart      = "vault"
  namespace  = "vault"

  set {
    name  = "server.ha.enabled"
    value = "true"
  }
  set {
    name  = "server.ha.raft.enabled"
    value = "true"
  }
  set {
    name  = "server.ha.raft.setNodeId"
    value = "true"
  }
  set {
    name = "server.extraEnvironmentVars.VAULT_CLIENT_TIMEOUT"
    value = "300s"
  }
  set {
    name  = "server.ha.raft.config"
    value = <<EOT
    ui = true
    listener "tcp" {
      tls_disable = 1
      address = "[::]:8200"
      cluster_address = "[::]:8201"
    }

    storage "raft" {
      path    = "/vault/data"
    }

    service_registration "kubernetes" {}

EOT
  }
}

Thanks.

Describe alternatives you've considered n/a

Additional context Took me a while to find the root cause, which was attributed to a go language error message "context deadline exceeded" which then got me to look at ways to increase the timeout value.

Note this is just a suggestion or breadcrumb for others.