hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.16k stars 4.21k forks source link

postgresql HA backend - vault auto seals when it cannot open connection to database due to exhausted local ports #11936

Open write0nly opened 3 years ago

write0nly commented 3 years ago

This issue was caught in QA/stress testing and is not really expected in a production environment, however could also be forced by users who can login to the vault server host.

Because vault (when using postgresql backend) does fast ephemeral connections to postgresql to create/delete leases and entries in the DB, if there are too many connections in TIME_WAIT or CLOSE_WAIT vault can cycle through the entirety of the port range available for connections and eventually run out of ports, printing the following type of error:

Jun 24 19:38:50 vault-2 vault[32594]: 2021-06-24T19:38:50.179+0100 [ERROR] core: failed to create token: error="failed to persist accessor index entry: dial tcp 10.0.0.1:5432: connect: cannot assign requested address"
Jun 24 19:38:50 vault-2 vault[32594]: 2021-06-24T19:38:50.841+0100 [ERROR] core: failed to create token: error="failed to persist accessor index entry: dial tcp 10.0.0.1:5432: connect: cannot assign requested address"

after some time of this happening vault errors and seals itself. In the case where vault tries to expire leases upon startup and has too many leases (let's say 200k it needs to expire) vault exhausts ports and then seals itself. This causes vault to become unusable because it keeps on re-sealing itself over and over again.

If the user is persistent and also unseals vault in a loop, then vault will reach a stable point when the number of leases goes below 10k, which can be seen with this query:

vault=> select path, count(path) from vault_kv_store group by path having count(path) > 500;
                path                | count
------------------------------------+-------
 /sys/expire/id/auth/approle/login/ |  1826
 /sys/token/accessor/               |  3300
 /sys/token/id/                     |  3968
(3 rows)

Steps to reproduce the behavior:

  1. Have a vault cluster (tested on v1.7.2) using the postgresql storage backend. This was tested in a cluster but may also happen on a single vault.

  2. Run vault write auth/approle/login role_id=... secret_id=... in a loop millions of times until we have 200k+ outstanding vault tokens that are going to expire.

  3. stop all vaults in the cluster, let's say due to an upgrade.

  4. Make sure you have a large number of outstanding leases in the DB:

    vault=> select path, count(path) from vault_kv_store group by path having count(path) > 500;
                path                | count
    ------------------------------------+--------
    /sys/expire/id/auth/approle/login/ | 226473
    /sys/token/accessor/               | 227460
    /sys/token/id/                     | 227691
    (3 rows)
  5. restart and unseal the vaults

Expected behavior

  1. vault gets unsealed and works normally doing lease deletion in background.

Observed behaviour

  1. Vault frantically tries to remove expired leases and delete leases from tables, rapidly cycling and exhausting all source ports to the postgresql server. When no ports are available anymore vault starts erroring and then seals itself.
cluster_name            = "test"
log_level               = "trace"
pid_file                = "/run/vault_pgsql.pid"

ui                      = true
disable_mlock           = true
verbose_oidc_logging    = true
raw_storage_endpoint    = true

# must have full protocol in the string
cluster_addr      = "https://..."
api_addr          = "https://..."

tls_require_and_verify_client_cert = "false"

listener "tcp" {
  address = "10.0.0.10:9999"
  tls_disable  = "false"
  tls_disable_client_certs = "true"
  tls_cert_file ="/etc/vault.d/tls/vault.crt"
  tls_key_file ="/etc/vault.d/tls/vault.key"

storage "postgresql" {
    connection_url = "postgres://vault_user:password@dbhost:5432/vault?sslmode=disable"
    ha_enabled     = true
}
write0nly commented 3 years ago

for the record this seems to happen due to the connection pooling which is too small by default (unset?). If we set max_idle_connections > max_parallel the connections are not torn down and there is no churn. It has the obvious down side of having many connections open, but maybe max_parallel can be lowered too.

The following setting worked flawlessly.

    max_idle_connections = 256
    max_parallel = 128

IMHO this could become: 1- change the default so that max_idle_connections >= max_parallel 2- document this clearly on the postgresql backend page

ncabatoff commented 3 years ago

for the record this seems to happen due to the connection pooling which is too small by default (unset?). If we set max_idle_connections > max_parallel the connections are not torn down and there is no churn. It has the obvious down side of having many connections open, but maybe max_parallel can be lowered too.

The following setting worked flawlessly.

    max_idle_connections = 256
    max_parallel = 128

IMHO this could become: 1- change the default so that max_idle_connections >= max_parallel 2- document this clearly on the postgresql backend page

Hi @write0nly,

This suggestion makes good sense to me, I'm all for it. I'm not sure when we'll get to it though, feel free to submit a PR if you get impatient.

heatherezell commented 3 years ago

Hi @write0nly - following up on Nick's comment, was this work that you'd be interested in taking up and filing a PR for? Please let us know how we can help. Thanks!

icy commented 2 years ago

for the record this seems to happen due to the connection pooling which is too small by default (unset?). If we set max_idle_connections > max_parallel the connections are not torn down and there is no churn. It has the obvious down side of having many connections open, but maybe max_parallel can be lowered too.

The following setting worked flawlessly.

    max_idle_connections = 256
    max_parallel = 128

IMHO this could become: 1- change the default so that max_idle_connections >= max_parallel 2- document this clearly on the postgresql backend page

Thanks for this. We have a small setup with mysql backend and we faced the same issue. In our case, the following configuration also works smoothly

    max_idle_connections = 10
    max_parallel = 5