hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.01k stars 4.12k forks source link

High CPU usage after upgrade to version 1.15.5 #26036

Open vmaletic opened 3 months ago

vmaletic commented 3 months ago

Describe the bug After upgrading from Vault version 1.15.4 to 1.15.5, there is high CPU usage on Vault servers when transit operations are called, even with a relatively small number of requests per second (RPS), causing CPU core usage to reach 100%.

To Reproduce Steps to reproduce the behavior:

  1. Execute HTTP API calls: transit/encrypt/my-key and transit/decrypt/my-key and monitor
  2. Monitor CPU usage of Vault primary node

Expected behavior After upgrading from Vault version 1.15.4 to 1.15.5, the CPU usage during transit operations should remain within acceptable limits. Specifically, the CPU core usage should not spike to 100% under small RPS.

Environment:

Vault server configuration file(s):

backend "consul" {
    address="127.0.0.1:8500"
    path="vault-uat01"
    ha_enabled="true"
}

listener "tcp" {
    address="xxx:8200"
    tls_disable=0
    tls_min_version="tls12"
    tls_cert_file="/etc/vault/ssl/vault.crt"
    tls_key_file="/etc/vault/ssl/vault.key"
    tls_cipher_suites = "TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"

}

telemetry {
  prometheus_retention_time = "30s"
  disable_hostname = true
}

max_lease_ttl = "1500h" 

Additional context Vault telemetry for version 1.15.5 with max. 300 RPS to transit backends during 5 minutes testing timeframe

CPU usage image

Transit usage image

Vault telemetry for version 1.15.4 with max. 2000 RPS to transit backends during 45 minutes testing timeframe

CPU usage image

Transit usage image

vmaletic commented 3 months ago

The same behaviour is observable in version 1.15.6

cleclefibanity commented 3 weeks ago

we got the problem too. Did you find a reason? We moved from 1.12 to 1.16

vmaletic commented 3 weeks ago

Unfortunately, no. We are sticking with version 1.15.4. We tested all subsequent versions (1.15.5 and later, including 1.16.x) and observed the same behavior.

hsimon-hashicorp commented 2 weeks ago

Thank you for testing this on 1.16 as well. I'll bring it up to our engineers. :)

cleclefibanity commented 2 weeks ago

Weird stuff: we rotate the transit key, and it solved the issue. We don't understand what could be the difference, as the old & the new keys are both working. Just the old one is causing high CPU usage

1337Seeker commented 2 weeks ago

yeah this is really strange behavior for sure, but that's pretty good news and something we will test and report back on

1337Seeker commented 2 weeks ago

Yesterday, we performed transit key rotation on all our transit secret engines. Subsequently, we upgraded to the latest version of Vault (1.17.0) and initiated our standard load testing. Unfortunately, we encountered significant performance degradation, which we had previously reported. Specifically:

Interestingly, reverting to Vault 1.15.4 resolved the issue entirely. With this version, performance is optimal, reaching up to 50-60% CPU load at 1000 RPS.

We are keen to understand why this performance discrepancy exists since versions 1.15.5 and 1.17.0. Any insights would be greatly appreciated.

cleclefibanity commented 2 weeks ago

May you rotate your key again to see if it fixes the problem ? That's how we solved it

cleclefibanity commented 1 week ago

not other info? We're about to rotate our key to solve the problem, but that's a pretty odd solution, without clear reason on the root cause

1337Seeker commented 1 day ago

Out of interest, did you rotate your transit keys while running on the latest version of Vault or did you complete the transit key rotation using a specific version of Vault and then upgrading to the latest version?

Please provide more information in terms of what worked for you (in order for us to test if we can replicate with same success as you've reported). Thank you in advance!

cleclefibanity commented 1 day ago

We upgraded first. Then we realised that there was an issue, and decided to rotate the keys (still on the newest version). Then the problem was solved