Incorrect metric consul_memberlist_node_instances

Foxsa commented 1 year ago

Overview of the Issue

When one of nodes going down (server crash) it stays in 'dead' state in the consul interface for 72 hours (as expected). But the prometheus metric of the consul server consul_memberlist_node_instances{node_state="dead"} stays 1 only for a couple of minutes. Metric consul_memberlist_node_instances{node_state="alive"} goes down by 1 as expected.

Reproduction Steps

2 machines running:

Consul server
Consul client

Power down Consul client machine

Monitor http://consul-server.lan:8500/ui/dc1/nodes and http://consul-server.lan:8500/v1/agent/metrics?format=prometheus

Consul info for both Client and Server

Client info

``` agent: check_monitors = 0 check_ttls = 6 checks = 9 services = 9 build: prerelease = revision = 192df66a version = 1.16.0 version_metadata = consul: acl = disabled known_servers = 5 server = false runtime: arch = amd64 cpu_count = 4 goroutines = 100 max_procs = 4 os = linux version = go1.20.4 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 120 failed = 1 health_score = 0 intent_queue = 0 left = 0 member_time = 16066 members = 61 query_queue = 0 query_time = 2 ``` ``` data_dir = "/opt/consul" ```

Server info

``` agent: check_monitors = 0 check_ttls = 5 checks = 9 services = 9 build: prerelease = revision = 192df66a version = 1.16.0 version_metadata = consul: acl = disabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 192.168.0.1:8300 server = true raft: applied_index = 3628061 commit_index = 3628061 fsm_pending = 0 last_contact = 0 last_log_index = 3628061 last_log_term = 22 last_snapshot_index = 3623276 last_snapshot_term = 22 latest_configuration = [] latest_configuration_index = 0 num_peers = 4 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 22 runtime: arch = amd64 cpu_count = 4 goroutines = 941 max_procs = 4 os = linux version = go1.20.4 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 120 failed = 1 health_score = 0 intent_queue = 0 left = 0 member_time = 16066 members = 61 query_queue = 0 query_time = 2 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 215 members = 5 query_queue = 0 query_time = 1 ``` ``` data_dir = "/opt/consul" ```

Operating system and Environment details

Log Fragments

huikang commented 1 year ago

@Foxsa , thanks for reporting the issue. To help troubleshoot, could you provide more details 1) How many nodes in the cluster? Server and client agents 2) KV of the Labels from the consul_memberlist_node_instances and consul_memberlist_node_instances metrics.

Foxsa commented 1 year ago

Hi @huikang ,

Currently we have 56 clients and 5 servers. KV of consul_memberlist_node_instances(alive) are:

application="consul"
cluster="infra"
instance="server001.lan:8500"
job="consul"
network="lan"
node_state="alive"
server="server001"

KV of consul_memberlist_node_instances(dead):

application="consul"
cluster="infra"
instance="server001.lan:8500"
job="consul"
network="lan"
node_state="dead"
server="server001"

hashicorp / consul