hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.34k stars 4.23k forks source link

PKI: Progressive performance degradation upon CA/issuer rotation #29083

Open aescaler-raft opened 22 hours ago

aescaler-raft commented 22 hours ago

Describe the bug

HA Vault PKI engine with raft storage experiences progressive and permanent performance degradation upon CA/issuer generate (root) or import (root or intermediate) operations. Separate PKI engines are not impacted (but experience the same progressive and permanent performance degradation). Performance can be restored by disabling and then re-enabling the engine at the previous endpoint.

Testing parameters:

This performance issue is not experienced for CA CSR generation. To me, this indicates that this issue is not tied to the private key, but to the issuer.

Observations:

Tracing:

On a Vault cluster that is experiencing PKI engine performance degradation, I called the trace endpoint /v1/sys/pprof/trace with the parameter seconds=60, after which I called the PKI engine's generate root endpoint. Analysis under go tool trace <file> showed high latency in the synchronization blocking profile, specifically this goroutine: github.com/hashicorp/raft.(*raftState).goFunc.func1. Under the synchronization blocking profile page, I noticed the graph showed edges with times greater than 60s and nodes showed 0 of <time> (<percent>%) where <time> and <percent> are non-zero.

To Reproduce

Steps to reproduce the behavior:

  1. Deploy Vault in HA on Kubernetes with Helm chart with raft integrated storage and awskms seal
  2. Run vault secrets enable pki
  3. Run for i in $(seq 1 60); do time curl -k -X POST "https://<vault_url>/v1/pki/root/generate/internal" -H "X-Vault-Token: <vault_token>" --data '{"common_name": "TEST CURL ROOT", "key_type": "rsa", "key_bits": 2048}'; done
  4. Observe increasing delay in responses
  5. Run sleep 600; time curl -k -X POST "https://<vault_url>/v1/pki/root/generate/internal" -H "X-Vault-Token: <vault_token>" --data '{"common_name": "TEST CURL ROOT", "key_type": "rsa", "key_bits": 2048}'
  6. Observe no decrease in delay in responses

Expected behavior

Vault PKI engine performance should not degrade, or at least recover.

Environment

Vault server configuration file(s)

disable_mlock = true
ui = true

listener "tcp" {
  tls_disable = true
  address = "[::]:8200"
  cluster_address = "[::]:8201"
}

storage "raft" {
  path = "/vault/data"
}

seal "awskms" {
  region = "eu-west-2"
  kms_key_id = "<clipped>"
  endpoint     = "http://local-kms:8081"
  access_key   = "dummy"
  secret_key   = "dummy"
}

service_registration "kubernetes" {}

Additional context

Attempted remediations:

aescaler-raft commented 22 hours ago

I realize that the example of 60 CA/issuer rotations as quickly as Vault is capable is unrealistic, however the fact that performance never recovers indicates that this will be encountered in the future.

aescaler-raft commented 21 hours ago

I've also tested this on a fresh EKS cluster with the ebs-csi-provisioner platform storage backend and an actual AWS KMS key, and observed the same effect.

stevendpclark commented 19 hours ago

Hi @aescaler-raft, thanks for filing the issue.

I could have sworn we had an open issue around this problem already but my search turned up empty. This is a known issue around having many issuers and rebuilding all the CRLs which always happens when a new issuer is created.

This shouldn't be a huge impact on day to day operations if the issuer count is kept low, which we highly recommend doing for various reasons see: https://developer.hashicorp.com/vault/docs/secrets/pki/considerations#one-ca-certificate-one-secrets-engine

I'll keep the issue open for visibility, and as another reminder that we need to make the CRL building smarter, more efficient within the PKI engine.

aescaler-raft commented 1 hour ago

Hi @stevendpclark,

Thanks for validating my observations, and providing a recommended path forward. Introducing additional engines/endpoints would dramatically increase the complexity of an app I'm building for a customer, so I'll have a conversation with them and make the recommendation that we contribute to the PKI engine to resolve this issue. I'll take a look at the contributors of the PKI engine codebase and figure out who will need to review our proposed changes, I'd like to start a dialogue early on in this effort. Can you tell me if there's anyone else we might need to involve from the HashiCorp side to sign off on this, assuming it becomes a series of architectural decisions? Do you all have a Slack channel that I (and possibly members of my team) can join?

Regarding the documentation, is there any specific reason why this isn't documented in the page linked? I'd be happy to submit a PR to do so in the meantime.