hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.98k stars 4.19k forks source link

Vault CLI operations sometimes hang for 5 seconds #4627

Closed kristian-lesko closed 2 years ago

kristian-lesko commented 6 years ago

Hello everyone,

I've encountered the following issue with using the Vault CLI tool:

Describe the bug When running any vault CLI operation (like read, write, secrets enable etc.) on a 3-node Vault cluster with Consul backend, sometimes the operation freezes for (more or less exactly) 5 seconds before completing.

To Reproduce Steps to reproduce the behavior:

  1. Run a vault write command many times repeatedly
  2. Observe a 5-second delay in some cases

Expected behavior Operations running without the delay.

Environment:

Vault server configuration file(s):

backend "consul" {
  address = "https://vault01:8700"
  cluster_addr = "https://vault01:8700"
  redirect_addr = "https://vault01:8700"
  token = "..."
  path = "vault"
  tls_cert_file = "..."
  tls_key_file = "..."
  tls_ca_file = "..."
}

listener "tcp" {
  address = "vault01:443"
  cluster_address = "vault01:444"
  tls_disable = 0
  tls_cert_file = "..."
  tls_key_file = "..."
}

default_lease_ttl = "1800s"
max_lease_ttl = "1800s"
disable_mlock = false

telemetry {
  statsd_address = "127.0.0.1:8125"
  disable_hostname = true
}

api_addr = "https://vault01:443"
cluster_addr = "https://vault01:444"
plugin_directory = "/etc/vault/plugins"

Additional context This happens randomly but often enough that the typical usage of Vault gets annoying for the user. When tested in a loop (without any delay), the issue happens about 5-6 times per 100 operations; however, for separate operations with a few seconds of delay, this can happen much more often (sometimes as often as every third operation).

We have a DNS alias pointing to the A-records of each of the 3 Vault clusters; the problem seems to happen regardless of whether we use this DNS alias or a direct URL of any single node.

The issue does not happen when using the API directly (e.g., by curl).

The interesting thing is that when the freeze happens, it takes (more or less exactly) 5 seconds; i.e., when running 1000 operations in a loop, a single operation either takes 0.1xx seconds, or 5.1xx seconds, with nothing in between.

The operation always finishes correctly after the 5-second delay.

jefferai commented 6 years ago

See https://www.vaultproject.io/docs/commands/index.html#vault_max_retries

We added this because people complained that it didn't retry, now we are getting complaints that it does. We may disable it again in the future by default, but that value can be used to control behavior.

kristian-lesko commented 6 years ago

Hello @jefferai,

thanks for the quick reply. However, I am afraid that this issue is not caused by the VAULT_MAX_RETRIES value, as with any value (0, 10 or anything else) these 5-second freezes still continue to occur like before.

jefferai commented 6 years ago

What happens if you set it to -1? We've actually changed the logic for the upcoming 0.10.2 (different library) but I think the previous library may not have worked the way we thought.

kristian-lesko commented 6 years ago

When I set VAULT_MAX_RETRIES to -1, I receive the following error upon read:

failed to read environment: strconv.ParseUint: parsing "-1": invalid syntax

mikebell90 commented 6 years ago

Fwiw I have also experienced this. We are testing vault and run a bootstrap she'll script that sets up the backends and a few roles and secrets. Probably 10 calls to vault. Service is otherwise almost I utilized but I still see a noticeable hang when running script sometimes from my Mac.

jefferai commented 6 years ago

In 0.10.2, when it's out, try using VAULT_MAX_RETRIES=0. It's a different retry library.

kristian-lesko commented 6 years ago

@jefferai thanks for your help; unfortunately, the same issue keeps happening in version 0.10.2; out of 1000 calls, about 990 complete within 0.1 seconds, but the remaining ~10 calls still take exactly 5 seconds more (i.e. about 5.0 - 5.1 seconds).

sgmiller commented 5 years ago

Any resolution on this? We're running 1.0.0 and appear to be hitting this issue using the Golang client (not the CLI tool).

jefferai commented 5 years ago

If you mean our Golang API libraries, it has the same retry behavior. You may want to try setting it off and simply handling whatever the error is. (You may also want to look in your server logs for the error.)

gudeyang1 commented 5 years ago

facing the sam problems with vault 1.2.2 using vault-helm in AliCloud, only happens with vault CLI, get quick response with REST API calls

[~]$ time vault status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.2.2
Cluster Name             vault-cluster-04418e87
Cluster ID               a40880a0-f5c0-bf85-387c-89131efd27f2
HA Enabled               true
HA Cluster               https://172.20.0.189:8201
HA Mode                  standby
Active Node Address      https://172.20.0.189:8200

real    0m5.399s
user    0m0.068s
sys     0m0.008s
vishalnayak commented 3 years ago

@sgmiller Do you still see this issue? :-)

aphorise commented 2 years ago

I'm closing this request as this issue does creep up occasionally and in almost all cases it's driven by environmental / VM factors not something specific to Vault binary.

For many of the exact same versions (including 1.9.x & 1.10.x) that were observed to be slow (only via CLI) there were OS / host configuration and when tested with an alternative hypervisor everything worked fine.