hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Best way to monitor cluster leader presence/absence #10733

Open Abhimanyu-Jana opened 3 years ago

Abhimanyu-Jana commented 3 years ago

Our monitoring for cluster leader is based on consul_raft_leader exported by the prometheus consul exporter

https://github.com/prometheus/consul_exporter

As I understand it gets the leader using /v1/status/leader endpoint

However there were many instances where queries to consul (v1/catalog/nodes or v1/catalog/services) were failing with HTTP 500 / "No cluster leader" despite /v1/status/leader saying there was a leader according to each individual node.

Is consul_raft_leader or /v1/status/leader still a good way to monitor for leader presence/absence?

jkirschner-hashicorp commented 3 years ago

Copying additional info here from your post on the linked repo: https://github.com/prometheus/consul_exporter/issues/208


What did you do? setup monitoring for presence/absence of cluster leader using consul_raft_leader metric

What did you expect to see? When external queries to consul cluster fail with HTTP 500 or "No cluster leader" error, we expect to see consul_raft_leader value change from 1 to 0

What did you see instead? Under which circumstances? consul_raft_leader value still remains 1 despite there being obvious issues with cluster health. We can confirm based on logs that show the "No cluster leader" errors, as well as using "consul operator raft list-peers" command

Environment Linux

consul_exporter version: 0.7.1

Consul version: Consul v1.8.3

Prometheus version: N/A

Prometheus configuration file: N/A

Logs:

Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)
jkirschner-hashicorp commented 3 years ago

Hi @Abhimanyu-Jana,

This seems like a bug - there shouldn't be a discrepancy between the leader status endpoint and what the rest of the cluster thinks.

To help us explore this, can you provide us with some additional information?

Abhimanyu-Jana commented 3 years ago

Thank you for your response. We'll look into getting this info ASAP

dnephin commented 3 years ago

I had a quick look into this. At first we thought it might have been fixed by #8408, but I suspect now that it's probably more likely the underlying issue that prompted #8404 is the same as this one, but that change unfortunately probably does not fix the issue.

I wonder if this problem might have been fixed by #9487. That change was backported into Consul v1.8.8. Previous to that change networking problems between a client and a server could have caused "No cluster leader" errors for RPC requests, even when raft still had a leader.

Does that seem like it might be the cause of the problem? Would upgrading to 1.8.8 be an option to see if the errors change to "Raft leader not found in server lookup mapping" ?

nahsi commented 3 years ago

Our monitoring for cluster leader is based on consul_raft_leader exported by the prometheus consul exporter https://github.com/prometheus/consul_exporter

Why do you need an outdated third party exporter when consul supports metrics in prometheus format natively?

Abhimanyu-Jana commented 3 years ago

@nahsi because of the issue described in https://github.com/hashicorp/consul/issues/5140

However you raised a valid point. In theory these metrics are probably all that's needed for monitoring without having to use the exporter.

Can you confirm if this behaviour described in #5140 was addressed in later versions?

Abhimanyu-Jana commented 3 years ago

btw was able to reproduce this the other day on a broken cluster

[consul_node1] $ /usr/sbin/consul operator raft list-peers -http-addr=http://$(hostname):8500 Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

[consul_node1] $ curl -s http://$(hostname):8500/v1/status/leader | jq ":8300"

jkirschner-hashicorp commented 3 years ago

@Abhimanyu-Jana : #5140 was resolved by PR #9198 (in Nov 2020).