hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.29k stars 4.23k forks source link

Unexpected behavior of vault_core_active metrics #21912

Open iveahugeship opened 1 year ago

iveahugeship commented 1 year ago

Describe the bug Hello!

By documentation vault.core.active returns a value 1 when the vault node is active, and 0 when node is in standby. But the problem is it returns unexpected values. It returns 1 if I use this metric with cluster label and 0 without this one.


Element | Value -- | -- vault_core_active{cluster="vault-cluster-85407450",instance="X.X.X.X:8200",job="vault"} | 1 vault_core_active{cluster="vault-cluster-85407450",instance="X.X.X.X:8200",job="vault"} | 1 vault_core_active{cluster="vault-cluster-85407450",instance="X.X.X.X:8200",job="vault"} | 1 vault_core_active{instance="X.X.X.X:8200",job="vault"} | 0 vault_core_active{instance="X.X.X.X:8200",job="vault"} | 0 vault_core_active{instance="X.X.X.X:8200",job="vault"} | 0

But consult tags shows the real states. 1 node is active and others are standby.

To Reproduce Steps to reproduce the behavior:

  1. Setup and run a Raft cluster
  2. Enable Vault telemetry and configure Prometheus to scrape Vault metrics
  3. Open Prometheus UI and run PQL query vault_core_active.
  4. See error

Expected behavior A value 0 for standby and a value 1 for active nodes.

Environment:

Vault server configuration file(s):

ui = true
disable_mlock = true
api_addr = "https://{{ GetInterfaceIP \"eth0\" }}:8200"
cluster_addr = "https://{{ GetInterfaceIP \"eth0\" }}:8201"

listener "unix" {
  address = "/var/vault.d/vault.sock"
} 

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_cert_file = "/etc/tls/vault/server.crt"
  tls_key_file = "/etc/tls/vault/server.key"
}

storage "raft" {
  retry_join {
    auto_join = "provider=os auth_url=*** project_id=***tag_key=*** tag_value=***"
    auto_join_scheme = "https"
  }

  path = "/var/vault.d"
}

telemetry {
  disable_hostname = true
  prometheus_retention_time = "12h"
}

service_registration "consul" {
  address = "127.0.0.1:8500"
}                                                                                                                                                                                                                                                                                                                                           

Additional context Prometheus version: 2.15.2

maxb commented 1 year ago

For clarity, I'd like to state up front that I'm not a HashiCorp employee, just a contributor who happens to have worked with Vault and Prometheus before.

This is a new presentation of the same underlying issue talked about in #11988.

Vault's default behaviour when using Prometheus metrics is really quite unhelpful, with many undocumented/underdocumented pitfalls.

HashiCorp people: Is there any way we could reignite the stalled conversations in hashicorp/go-metrics#136, hashicorp/consul#13495, hashicorp/consul#13498, and #11988 (yes, some of those are Consul issues, but it applies to both products)?

iveahugeship commented 1 year ago

@maxb thanks for your reply.

I've found that this metrics doesn't work properly: