hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.88k stars 4.17k forks source link

Vault leader fails to issue credentials for PostgreSQL AWS RDS #11574

Open mgeggie opened 3 years ago

mgeggie commented 3 years ago

Describe the bug When running Vault server cluster with 3 nodes, our leader becomes unable to issue credentials for PostgreSQL AWS RDS database instances.

To Reproduce Steps to reproduce the behavior:

  1. Run vault server. Leave running for a period of 2-4 hours.
  2. Run vault read database/creds/<rds_database>
  3. Vault hangs attempting to issue credentials.

Expected behavior Vault should issue database user credentials for the PostgreSQL AWS RDS database instance.

Environment:

Vault server configuration file(s):

storage "consul" {
    address = "169.254.1.1:8500"
    path    = "vault/"
}

listener "tcp" {
    address = "0.0.0.0:8200"
    tls_key_file = "/etc/vault.d/vault-prod-complete.key.pem"
    tls_cert_file = "/etc/vault.d/vault-prod-20201207-complete.cert.pem"
    tls_min_version = "tls12"
    telemetry {
        unauthenticated_metrics_access = true
    }
}

telemetry {
    usage_gauge_period        = "5m"
    prometheus_retention_time = "24h"
    disable_hostname          = true
}

seal "awskms" {
    kms_key_id = "<kms_key_id>"
}

ui = "true"

log_level = "trace"

# NOTE: DO NOT change api_addr to use hostnames!
#
# Consul 1.6.1 does not support recursive DNS lookups. If this is set to a non-IP value
# when a lookup for vault.service.consul reaches Consul's DNS server, Consul will will then
# resolve a CNAME to a Vault server's hostname, which it will then not be able to resolve itself
# to an IP address and result in a DNS resolution failure, even though the service will appear
# healthy in Consul!
api_addr = "https://<local_ip_address>:8200"

pid_file = "/var/run/vault.pid"

Additional context While our Vault servers are unable to issue credentials for this AWS RDS database, Vault continues to be able to issue static secrets stored in the secretv1 mount, and other database credentials for self-hosted PostgreSQL databases.

A stack track from the failed Vault leader can also be found in the following Gist: https://gist.github.com/mgeggie/acbcc87c7bcecd75bdf93682e226b229

Logs from the failed Vault leader can be found in the following Gist: https://gist.github.com/mgeggie/9dcb29124932b1a0bb674060d1d7f9f1

vishalnayak commented 3 years ago

Hey there, I don't see anything outstanding in the logs. Is this still an issue, or were you able to get past it? From what you are describing, it seems to me that Vault is functioning properly with everything else but with this database plugin. Were you able to verify that this isn't a connection issue between Vault and the AWS RDS database?

mgeggie commented 3 years ago

This is still affecting us daily on both our test and production Vault clusters.

We've been able to confirm on multiple occasions that there is good connectivity between our Vault servers and these database using psql.

@vishalnayak is there a different place I should file a bug for the Vault PostgreSQL database plugin?

heatherezell commented 3 years ago

@mgeggie Does this work when the Vault server is newly running? I'm curious about the 2-4 hours in your initial comment, especially as we look for potential repro steps.

mgeggie commented 3 years ago

Hi @hsimon-hashicorp thanks for following up. Our only successful mitigation for this problem was to restart all of our Vault servers. After that our Vault leader was able to issue credentials again.

Another mitigation strategy we attempted was to restart the database plugin from the Vault API. That had no effect on functionality, and in fact, the API call never returned a HTTP return code, it eventually timed out from the client.

We signed a support agreement with Hashicorp and upgraded from Vault OSS 1.7.0 to Vault Enterprise 1.7.0 shortly after filing this bug. Thankfully, we haven't seen a recurrence of the problem since. The support case we filed also saw no successful resolution FYI.

heatherezell commented 3 years ago

Hi! Can you provide information on the support case? I'm trying to more closely partner with the Vault support manager because so many of the issues we see cross those boundaries. It'll help us get more focus on these issues. Thanks so much! Feel free to email me - the first part of my GitHub username at hashicorp - with more details.

heatherezell commented 3 years ago

Received the requested information, will follow up the week of 9/13.