hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.34k stars 4.23k forks source link

High Goroutine Count and Latency Issue in Vault HA Cluster #25623

Closed dedsxc closed 9 months ago

dedsxc commented 9 months ago

Describe the bug I have deployed a HashiCorp Vault HA cluster with Raft storage approximately three months ago. Recently, I have noticed a significant latency when accessing the Vault UI interface. Upon investigation, I observed that the number of Goroutines being created has exceeded 270.

Despite the high Goroutine count, there is no notable activity in terms of incoming connections. Additionally, I enabled debug mode in the Vault logs, but there are no error logs or other remarkable entries that could explain the latency issue.

This issue is impacting the usability and performance of the Vault cluster, and I am seeking assistance in diagnosing and resolving the underlying cause. Any insights or suggestions for troubleshooting would be greatly appreciated.

PS: I tried to increase the resource request on the pod without any effect

resources:
  requests:
    memory: 2Gi
    cpu: 2

To Reproduce Steps to reproduce the behavior: Deploy a raft storage cluster with 3 replicas with version 1.15.5

Expected behavior Accessing to the UI interface with no latency.

Environment:

Vault server configuration file(s):

disable_mlock = true
ui = true

listener "tcp" {
  tls_disable = 1
  address = "[::]:8200"
  cluster_address = "[::]:8201"
  # Enable unauthenticated metrics access (necessary for Prometheus Operator)
  telemetry {
   unauthenticated_metrics_access = "true"
  }
}

telemetry {
  disable_hostname = true
  prometheus_retention_time = "12h"
}

seal "awskms" {
  region = "us-east-1"
  kms_key_id = "my-key"
}

storage "raft" {
  path = "/vault/data"
  retry_join {
  leader_api_addr = "http://vault-0.vault-internal:8200"
  }
  retry_join {
  leader_api_addr = "http://vault-1.vault-internal:8200"
  }
  retry_join {
  leader_api_addr = "http://vault-2.vault-internal:8200"
  }
}

service_registration "kubernetes" {}

Additional context image

debug logs

==> Vault server configuration:

Administrative Namespace: 
             Api Address: http://10.251.38.229:8200
                     Cgo: disabled
         Cluster Address: https://vault-0.vault-internal:8201
   Environment Variables: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, DD_AGENT_HOST, DD_ENTITY_ID, DD_ENV, DD_INSTRUMENTATION_TELEMETRY_ENABLED, DD_INTEGRATION_HTTPURLCONNECTION_ENABLED, DD_INTEGRATION_NETTY-4_0_ENABLED, DD_INTEGRATION_NETTY-4_1_ENABLED, DD_INTEGRATION_NETTY_ENABLED, DD_LOGS_INJECTION, DD_SERVICE, DD_VERSION, GODEBUG, GRAFANA_PORT, GRAFANA_PORT_3000_TCP, GRAFANA_PORT_3000_TCP_ADDR, GRAFANA_PORT_3000_TCP_PORT, GRAFANA_PORT_3000_TCP_PROTO, GRAFANA_SERVICE_HOST, GRAFANA_SERVICE_PORT, GRAFANA_SERVICE_PORT_HTTP, HOME, HOSTNAME, HOST_IP, KUBERNETES_PORT, KUBERNETES_PORT_443_TCP, KUBERNETES_PORT_443_TCP_ADDR, KUBERNETES_PORT_443_TCP_PORT, KUBERNETES_PORT_443_TCP_PROTO, KUBERNETES_SERVICE_HOST, KUBERNETES_SERVICE_PORT, KUBERNETES_SERVICE_PORT_HTTPS, NAME, PATH, POD_IP, PROMETHEUS_SERVER_TMP_PORT, PROMETHEUS_SERVER_TMP_PORT_80_TCP, PROMETHEUS_SERVER_TMP_PORT_80_TCP_ADDR, PROMETHEUS_SERVER_TMP_PORT_80_TCP_PORT, PROMETHEUS_SERVER_TMP_PORT_80_TCP_PROTO, PROMETHEUS_SERVER_TMP_SERVICE_HOST, PROMETHEUS_SERVER_TMP_SERVICE_PORT, PROMETHEUS_SERVER_TMP_SERVICE_PORT_HTTP, PWD, SHLVL, SKIP_CHOWN, SKIP_SETCAP, VAULT_ACTIVE_PORT, VAULT_ACTIVE_PORT_8200_TCP, VAULT_ACTIVE_PORT_8200_TCP_ADDR, VAULT_ACTIVE_PORT_8200_TCP_PORT, VAULT_ACTIVE_PORT_8200_TCP_PROTO, VAULT_ACTIVE_PORT_8201_TCP, VAULT_ACTIVE_PORT_8201_TCP_ADDR, VAULT_ACTIVE_PORT_8201_TCP_PORT, VAULT_ACTIVE_PORT_8201_TCP_PROTO, VAULT_ACTIVE_SERVICE_HOST, VAULT_ACTIVE_SERVICE_PORT, VAULT_ACTIVE_SERVICE_PORT_HTTP, VAULT_ACTIVE_SERVICE_PORT_HTTPS_INTERNAL, VAULT_ADDR, VAULT_API_ADDR, VAULT_CLUSTER_ADDR, VAULT_K8S_NAMESPACE, VAULT_K8S_POD_NAME, VAULT_LOG_FORMAT, VAULT_LOG_LEVEL, VAULT_PORT, VAULT_PORT_8200_TCP, VAULT_PORT_8200_TCP_ADDR, VAULT_PORT_8200_TCP_PORT, VAULT_PORT_8200_TCP_PROTO, VAULT_PORT_8201_TCP, VAULT_PORT_8201_TCP_ADDR, VAULT_PORT_8201_TCP_PORT, VAULT_PORT_8201_TCP_PROTO, VAULT_SERVICE_HOST, VAULT_SERVICE_PORT, VAULT_SERVICE_PORT_HTTP, VAULT_SERVICE_PORT_HTTPS_INTERNAL, VAULT_STANDBY_PORT, VAULT_STANDBY_PORT_8200_TCP, VAULT_STANDBY_PORT_8200_TCP_ADDR, VAULT_STANDBY_PORT_8200_TCP_PORT, VAULT_STANDBY_PORT_8200_TCP_PROTO, VAULT_STANDBY_PORT_8201_TCP, VAULT_STANDBY_PORT_8201_TCP_ADDR, VAULT_STANDBY_PORT_8201_TCP_PORT, VAULT_STANDBY_PORT_8201_TCP_PROTO, VAULT_STANDBY_SERVICE_HOST, VAULT_STANDBY_SERVICE_PORT, VAULT_STANDBY_SERVICE_PORT_HTTP, VAULT_STANDBY_SERVICE_PORT_HTTPS_INTERNAL, VERSION
              Go Version: go1.21.5
              Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: debug
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: raft (HA available)
                 Version: Vault v1.15.5, built 2024-01-26T14:53:40Z
             Version Sha: 0d8b67ef63815f20421c11fe9152d435af3403e6

==> Vault server started! Log data will stream in below:

{"@level":"info","@message":"proxy environment","@timestamp":"2024-02-26T10:37:17.296916Z","http_proxy":"","https_proxy":"","no_proxy":""}
{"@level":"warn","@message":"raft FSM db file has wider permissions than needed","@module":"storage.raft.fsm","@timestamp":"2024-02-26T10:37:17.297804Z","existing":432,"needed":384}
{"@level":"debug","@message":"time to open database","@module":"storage.raft.fsm","@timestamp":"2024-02-26T10:37:17.363038Z","elapsed":65103461,"path":"/vault/data/vault.db"}
{"@level":"debug","@message":"\"namespace\": \"hc-vault\"","@module":"service_registration.kubernetes","@timestamp":"2024-02-26T10:37:17.599849Z"}
{"@level":"debug","@message":"\"pod_name\": \"vault-0\"","@module":"service_registration.kubernetes","@timestamp":"2024-02-26T10:37:17.599869Z"}
{"@level":"info","@message":"incrementing seal generation","@timestamp":"2024-02-26T10:37:17.632318Z","generation":1}
{"@level":"debug","@message":"set config","@module":"core","@timestamp":"2024-02-26T10:37:17.632509Z","sanitized config":"{\"administrative_namespace_path\":\"\",\"api_addr\":\"\",\"cache_size\":0,\"cluster_addr\":\"https://vault-0.vault-internal:8201\",\"cluster_cipher_suites\":\"\",\"cluster_name\":\"\",\"default_lease_ttl\":0,\"default_max_request_duration\":0,\"detect_deadlocks\":\"\",\"disable_cache\":false,\"disable_clustering\":false,\"disable_indexing\":false,\"disable_mlock\":true,\"disable_performance_standby\":false,\"disable_printable_check\":false,\"disable_sealwrap\":false,\"disable_sentinel_trace\":false,\"enable_response_header_hostname\":false,\"enable_response_header_raft_node_id\":false,\"enable_ui\":true,\"experiments\":null,\"imprecise_lease_role_tracking\":false,\"introspection_endpoint\":false,\"listeners\":[{\"config\":{\"address\":\"[::]:8200\",\"cluster_address\":\"[::]:8201\",\"telemetry\":[{\"unauthenticated_metrics_access\":\"true\"}],\"tls_disable\":1},\"type\":\"tcp\"}],\"log_format\":\"json\",\"log_level\":\"debug\",\"log_requests_level\":\"\",\"max_lease_ttl\":0,\"pid_file\":\"\",\"plugin_directory\":\"\",\"plugin_file_permissions\":0,\"plugin_file_uid\":0,\"raw_storage_endpoint\":false,\"seals\":[{\"disabled\":false,\"name\":\"awskms\",\"type\":\"awskms\"}],\"service_registration\":{\"type\":\"kubernetes\"},\"storage\":{\"cluster_addr\":\"\",\"disable_clustering\":false,\"raft\":{\"max_entry_size\":\"\"},\"redirect_addr\":\"\",\"type\":\"raft\"},\"telemetry\":{\"add_lease_metrics_namespace_labels\":false,\"add_mount_point_rollback_metrics\":false,\"circonus_api_app\":\"\",\"circonus_api_token\":\"\",\"circonus_api_url\":\"\",\"circonus_broker_id\":\"\",\"circonus_broker_select_tag\":\"\",\"circonus_check_display_name\":\"\",\"circonus_check_force_metric_activation\":\"\",\"circonus_check_id\":\"\",\"circonus_check_instance_id\":\"\",\"circonus_check_search_tag\":\"\",\"circonus_check_tags\":\"\",\"circonus_submission_interval\":\"\",\"circonus_submission_url\":\"\",\"disable_hostname\":true,\"dogstatsd_addr\":\"\",\"dogstatsd_tags\":null,\"lease_metrics_epsilon\":3600000000000,\"maximum_gauge_cardinality\":500,\"metrics_prefix\":\"\",\"num_lease_metrics_buckets\":168,\"prometheus_retention_time\":43200000000000,\"stackdriver_debug_logs\":false,\"stackdriver_location\":\"\",\"stackdriver_namespace\":\"\",\"stackdriver_project_id\":\"\",\"statsd_address\":\"\",\"statsite_address\":\"\",\"usage_gauge_period\":600000000000}}"}
{"@level":"debug","@message":"creating LRU cache","@module":"storage.cache","@timestamp":"2024-02-26T10:37:17.632542Z","size":0}
{"@level":"info","@message":"Initializing version history cache for core","@module":"core","@timestamp":"2024-02-26T10:37:17.632788Z"}
{"@level":"info","@message":"Starting event system","@module":"events","@timestamp":"2024-02-26T10:37:17.632799Z"}
{"@level":"debug","@message":"cluster listener addresses synthesized","@timestamp":"2024-02-26T10:37:17.632965Z","cluster_addresses":[{"IP":"::","Port":8201,"Zone":""}]}
{"@level":"info","@message":"stored unseal keys supported, attempting fetch","@module":"core","@timestamp":"2024-02-26T10:37:17.633328Z"}
{"@level":"debug","@message":"starting cluster listeners","@module":"core","@timestamp":"2024-02-26T10:37:17.658076Z"}
{"@level":"info","@message":"starting listener","@module":"core.cluster-listener.tcp","@timestamp":"2024-02-26T10:37:17.658092Z","listener_address":{"IP":"::","Port":8201,"Zone":""}}
{"@level":"info","@message":"serving cluster requests","@module":"core.cluster-listener","@timestamp":"2024-02-26T10:37:17.658128Z","cluster_listen_address":{"IP":"::","Port":8201,"Zone":""}}
{"@level":"info","@message":"creating Raft","@module":"storage.raft","@timestamp":"2024-02-26T10:37:17.658352Z","config":"\u0026raft.Config{ProtocolVersion:3, HeartbeatTimeout:15000000000, ElectionTimeout:15000000000, CommitTimeout:50000000, MaxAppendEntries:64, BatchApplyCh:true, ShutdownOnRemove:true, TrailingLogs:0x2800, SnapshotInterval:120000000000, SnapshotThreshold:0x2000, LeaderLeaseTimeout:2500000000, LocalID:\"6dedbcd8-556a-a469-1c26-fd720ea6d277\", NotifyCh:(chan\u003c- bool)(0xc002ecc1c0), LogOutput:io.Writer(nil), LogLevel:\"DEBUG\", Logger:(*hclog.interceptLogger)(0xc002e8e0c0), NoSnapshotRestoreOnStart:true, skipStartup:false}"}
{"@level":"info","@message":"initial configuration","@module":"storage.raft","@timestamp":"2024-02-26T10:37:17.660096Z","index":591047,"servers":"[{Suffrage:Voter ID:6dedbcd8-556a-a469-1c26-fd720ea6d277 Address:vault-0.vault-internal:8201} {Suffrage:Voter ID:946ce323-392b-e5a9-7737-45f6af64dfe6 Address:vault-2.vault-internal:8201} {Suffrage:Voter ID:4b8e03a9-16ca-666b-e26a-dd9fb08f3f13 Address:vault-1.vault-internal:8201}]"}
{"@level":"info","@message":"vault is unsealed","@module":"core","@timestamp":"2024-02-26T10:37:17.660197Z"}
{"@level":"info","@message":"entering standby mode","@module":"core","@timestamp":"2024-02-26T10:37:17.660220Z"}
{"@level":"info","@message":"entering follower state","@module":"storage.raft","@timestamp":"2024-02-26T10:37:17.660240Z","follower":{},"leader-address":"","leader-id":""}
{"@level":"debug","@message":"would have sent systemd notification (systemd not present)","@timestamp":"2024-02-26T10:37:17.706150Z","notification":"READY=1"}
{"@level":"info","@message":"unsealed with stored key","@module":"core","@timestamp":"2024-02-26T10:37:17.895075Z"}
{"@level":"debug","@message":"parsing information for new active node","@module":"core","@timestamp":"2024-02-26T10:37:20.160708Z","active_cluster_addr":"https://vault-1.vault-internal:8201","active_redirect_addr":"http://10.251.32.111:8200"}
{"@level":"debug","@message":"refreshing forwarding connection","@module":"core","@timestamp":"2024-02-26T10:37:20.160779Z","clusterAddr":"https://vault-1.vault-internal:8201"}
{"@level":"debug","@message":"clearing forwarding clients","@module":"core","@timestamp":"2024-02-26T10:37:20.160791Z"}
{"@level":"debug","@message":"done clearing forwarding clients","@module":"core","@timestamp":"2024-02-26T10:37:20.160798Z"}
{"@level":"debug","@message":"done refreshing forwarding connection","@module":"core","@timestamp":"2024-02-26T10:37:20.161253Z","clusterAddr":"https://vault-1.vault-internal:8201"}
{"@level":"debug","@message":"creating rpc dialer","@module":"core.cluster-listener","@timestamp":"2024-02-26T10:37:20.161290Z","address":"vault-1.vault-internal:8201","alpn":"req_fw_sb-act_v1","host":"fw-8911088e-eb2b-e7fe-52f3-1310e93fa987"}
{"@level":"debug","@message":"forwarding: error sending echo request to active node","@module":"core","@timestamp":"2024-02-26T10:37:20.163962Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: internal error\""}
{"@level":"debug","@message":"creating rpc dialer","@module":"core.cluster-listener","@timestamp":"2024-02-26T10:37:21.164634Z","address":"vault-1.vault-internal:8201","alpn":"req_fw_sb-act_v1","host":"fw-8911088e-eb2b-e7fe-52f3-1310e93fa987"}
{"@level":"debug","@message":"creating rpc dialer","@module":"core.cluster-listener","@timestamp":"2024-02-26T10:37:22.707146Z","address":"vault-1.vault-internal:8201","alpn":"req_fw_sb-act_v1","host":"fw-8911088e-eb2b-e7fe-52f3-1310e93fa987"}
{"@level":"debug","@message":"forwarding: error sending echo request to active node","@module":"core","@timestamp":"2024-02-26T10:37:25.161968Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: internal error\""}
{"@level":"debug","@message":"creating rpc dialer","@module":"core.cluster-listener","@timestamp":"2024-02-26T10:37:25.378127Z","address":"vault-1.vault-internal:8201","alpn":"req_fw_sb-act_v1","host":"fw-8911088e-eb2b-e7fe-52f3-1310e93fa987"}
{"@level":"debug","@message":"creating rpc dialer","@module":"core.cluster-listener","@timestamp":"2024-02-26T10:37:30.084432Z","address":"vault-1.vault-internal:8201","alpn":"req_fw_sb-act_v1","host":"fw-8911088e-eb2b-e7fe-52f3-1310e93fa987"}
{"@level":"debug","@message":"forwarding: error sending echo request to active node","@module":"core","@timestamp":"2024-02-26T10:37:30.161818Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: internal error\""}
{"@level":"debug","@message":"forwarding: error sending echo request to active node","@module":"core","@timestamp":"2024-02-26T10:37:35.162218Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: internal error\""}
{"@level":"debug","@message":"creating rpc dialer","@module":"core.cluster-listener","@timestamp":"2024-02-26T10:37:35.869369Z","address":"vault-1.vault-internal:8201","alpn":"req_fw_sb-act_v1","host":"fw-8911088e-eb2b-e7fe-52f3-1310e93fa987"}
{"@level":"debug","@message":"performing server cert lookup","@module":"core.cluster-listener","@timestamp":"2024-02-26T10:37:38.633670Z"}
{"@level":"debug","@message":"accepted connection","@module":"storage.raft.raft-net","@timestamp":"2024-02-26T10:37:38.662186Z","local-address":"vault-0.vault-internal:8201","remote-address":"10.251.35.96:59438"}
{"@level":"debug","@message":"lost leadership because received a requestVote with a newer term","@module":"storage.raft","@timestamp":"2024-02-26T10:37:38.662742Z"}
{"@level":"warn","@message":"failed to get previous log","@module":"storage.raft","@timestamp":"2024-02-26T10:37:38.668719Z","error":"log not found","last-index":604840,"previous-index":604890}
{"@level":"debug","@message":"performing server cert lookup","@module":"core.cluster-listener","@timestamp":"2024-02-26T10:37:39.388910Z"}
{"@level":"debug","@message":"accepted connection","@module":"storage.raft.raft-net","@timestamp":"2024-02-26T10:37:39.399108Z","local-address":"vault-0.vault-internal:8201","remote-address":"10.251.35.96:59452"}
{"@level":"debug","@message":"forwarding: error sending echo request to active node","@module":"core","@timestamp":"2024-02-26T10:37:40.161856Z","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: remote error: tls: internal error\""}
{"@level":"debug","@message":"parsing information for new active node","@module":"core","@timestamp":"2024-02-26T10:37:40.165117Z","active_cluster_addr":"https://vault-2.vault-internal:8201","active_redirect_addr":"http://10.251.35.96:8200"}
{"@level":"debug","@message":"refreshing forwarding connection","@module":"core","@timestamp":"2024-02-26T10:37:40.165200Z","clusterAddr":"https://vault-2.vault-internal:8201"}
{"@level":"debug","@message":"clearing forwarding clients","@module":"core","@timestamp":"2024-02-26T10:37:40.165213Z"}
{"@level":"debug","@message":"forwarding: stopping heartbeating","@module":"core","@timestamp":"2024-02-26T10:37:40.165248Z"}
{"@level":"debug","@message":"done clearing forwarding clients","@module":"core","@timestamp":"2024-02-26T10:37:40.165278Z"}
{"@level":"debug","@message":"done refreshing forwarding connection","@module":"core","@timestamp":"2024-02-26T10:37:40.165452Z","clusterAddr":"https://vault-2.vault-internal:8201"}
{"@level":"debug","@message":"creating rpc dialer","@module":"core.cluster-listener","@timestamp":"2024-02-26T10:37:40.165515Z","address":"vault-2.vault-internal:8201","alpn":"req_fw_sb-act_v1","host":"fw-2e5a3692-2657-37a8-b6fa-df92e332a736"}
{"@level":"debug","@message":"performing client cert lookup","@module":"core.cluster-listener","@timestamp":"2024-02-26T10:37:40.174007Z"}

pprof output

go tool pprof goroutine_endpoint
File: vault
Type: goroutine
Time: Feb 26, 2024 at 3:52pm (CET)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 282, 100% of 282 total
Showing top 10 nodes out of 138
      flat  flat%   sum%        cum   cum%
       280 99.29% 99.29%        280 99.29%  runtime.gopark
         1  0.35% 99.65%          1  0.35%  runtime.goroutineProfileWithLabels
         1  0.35%   100%          1  0.35%  runtime.notetsleepg
         0     0%   100%          2  0.71%  bufio.(*Reader).Peek
         0     0%   100%          2  0.71%  bufio.(*Reader).fill
         0     0%   100%          2  0.71%  bytes.(*Buffer).ReadFrom
         0     0%   100%          2  0.71%  crypto/tls.(*Conn).Read
         0     0%   100%          2  0.71%  crypto/tls.(*Conn).readFromUntil
         0     0%   100%          2  0.71%  crypto/tls.(*Conn).readRecord (inline)
         0     0%   100%          2  0.71%  crypto/tls.(*Conn).readRecordOrCCS

image image image

dedsxc commented 9 months ago

Update on this issue:

After disabling the audit socket device, the vault has no more latency. I activated the audit socket device to integrate with Elasticsearch SIEM. But the SIEM is no more running on our K8S cluster, so I think the audit socket device was trying to contact the elastic agent in a loop. Not sure as I don't have any logs about that .. But disabling the audit socket device solve my issue.

Note: Can you integrate logs about audit device ?