Open Foxsa opened 1 year ago
@Foxsa , thanks for reporting the issue. To help troubleshoot, could you provide more details
1) How many nodes in the cluster? Server and client agents
2) KV of the Labels from the consul_memberlist_node_instances
and consul_memberlist_node_instances
metrics.
Hi @huikang ,
Currently we have 56 clients and 5 servers. KV of consul_memberlist_node_instances(alive) are:
KV of consul_memberlist_node_instances(dead):
Overview of the Issue
When one of nodes going down (server crash) it stays in 'dead' state in the consul interface for 72 hours (as expected). But the prometheus metric of the consul server
consul_memberlist_node_instances{node_state="dead"}
stays 1 only for a couple of minutes. Metricconsul_memberlist_node_instances{node_state="alive"}
goes down by 1 as expected.Reproduction Steps
2 machines running:
Power down Consul client machine
Monitor http://consul-server.lan:8500/ui/dc1/nodes and http://consul-server.lan:8500/v1/agent/metrics?format=prometheus
Consul info for both Client and Server
Client info
``` agent: check_monitors = 0 check_ttls = 6 checks = 9 services = 9 build: prerelease = revision = 192df66a version = 1.16.0 version_metadata = consul: acl = disabled known_servers = 5 server = false runtime: arch = amd64 cpu_count = 4 goroutines = 100 max_procs = 4 os = linux version = go1.20.4 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 120 failed = 1 health_score = 0 intent_queue = 0 left = 0 member_time = 16066 members = 61 query_queue = 0 query_time = 2 ``` ``` data_dir = "/opt/consul" ```Server info
``` agent: check_monitors = 0 check_ttls = 5 checks = 9 services = 9 build: prerelease = revision = 192df66a version = 1.16.0 version_metadata = consul: acl = disabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 192.168.0.1:8300 server = true raft: applied_index = 3628061 commit_index = 3628061 fsm_pending = 0 last_contact = 0 last_log_index = 3628061 last_log_term = 22 last_snapshot_index = 3623276 last_snapshot_term = 22 latest_configuration = [Operating system and Environment details
Log Fragments