Open Abhimanyu-Jana opened 3 years ago
Copying additional info here from your post on the linked repo: https://github.com/prometheus/consul_exporter/issues/208
What did you do? setup monitoring for presence/absence of cluster leader using consul_raft_leader metric
What did you expect to see? When external queries to consul cluster fail with HTTP 500 or "No cluster leader" error, we expect to see consul_raft_leader value change from 1 to 0
What did you see instead? Under which circumstances? consul_raft_leader value still remains 1 despite there being obvious issues with cluster health. We can confirm based on logs that show the "No cluster leader" errors, as well as using "consul operator raft list-peers" command
Environment Linux
consul_exporter version: 0.7.1
Consul version: Consul v1.8.3
Prometheus version: N/A
Prometheus configuration file: N/A
Logs:
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)
Hi @Abhimanyu-Jana,
This seems like a bug - there shouldn't be a discrepancy between the leader status endpoint and what the rest of the cluster thinks.
To help us explore this, can you provide us with some additional information?
consul info
from the client agent and the server agent when this condition occurs?-log-level=TRACE
on the client and server to capture the maximum log detail.Thank you for your response. We'll look into getting this info ASAP
I had a quick look into this. At first we thought it might have been fixed by #8408, but I suspect now that it's probably more likely the underlying issue that prompted #8404 is the same as this one, but that change unfortunately probably does not fix the issue.
I wonder if this problem might have been fixed by #9487. That change was backported into Consul v1.8.8. Previous to that change networking problems between a client and a server could have caused "No cluster leader" errors for RPC requests, even when raft still had a leader.
Does that seem like it might be the cause of the problem? Would upgrading to 1.8.8 be an option to see if the errors change to "Raft leader not found in server lookup mapping" ?
Our monitoring for cluster leader is based on consul_raft_leader exported by the prometheus consul exporter https://github.com/prometheus/consul_exporter
Why do you need an outdated third party exporter when consul supports metrics in prometheus format natively?
@nahsi because of the issue described in https://github.com/hashicorp/consul/issues/5140
However you raised a valid point. In theory these metrics are probably all that's needed for monitoring without having to use the exporter.
Can you confirm if this behaviour described in #5140 was addressed in later versions?
btw was able to reproduce this the other day on a broken cluster
[consul_node1] $ /usr/sbin/consul operator raft list-peers -http-addr=http://$(hostname):8500 Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)
[consul_node1] $ curl -s http://$(hostname):8500/v1/status/leader | jq
"
@Abhimanyu-Jana : #5140 was resolved by PR #9198 (in Nov 2020).
Our monitoring for cluster leader is based on consul_raft_leader exported by the prometheus consul exporter
https://github.com/prometheus/consul_exporter
As I understand it gets the leader using /v1/status/leader endpoint
However there were many instances where queries to consul (v1/catalog/nodes or v1/catalog/services) were failing with HTTP 500 / "No cluster leader" despite /v1/status/leader saying there was a leader according to each individual node.
Is consul_raft_leader or /v1/status/leader still a good way to monitor for leader presence/absence?