prometheus-community / elasticsearch_exporter

Elasticsearch stats exporter for Prometheus
Apache License 2.0
1.94k stars 791 forks source link

Cluster red state results in broken exporter #301

Closed a-nldisr closed 5 years ago

a-nldisr commented 5 years ago

We lost an Elasticsearch cluster in our Acceptance environment during the AWS outage in Frankfurt availability zone C last tuesday. We had a couple of alerts setup to monitor clusters for red state, however the exporter is having some issues and we never received an alert till after all the dust settled. We found out that this exporter broke when the cluster turned red because we lost all of our data nodes.

These are the logs from the exporter:

level=warn ts=2019-11-15T09:43:44.572809589Z caller=cluster_health.go:258 msg="failed to fetch and decode cluster health" err="json: cannot unmarshal string into Go struct field clusterHealthResponse.active_shards_percent_as_number of type float64"
level=warn ts=2019-11-15T09:43:44.573312711Z caller=indices.go:409 msg="failed to fetch and decode index stats" err="HTTP Request failed with code 503"

exporter version: 1.0.2

How to replicate:

When i query this exporter i still get metrics, from the masters, one data node.

Some of these metrics are no longer available: elasticsearch_cluster_health_number_of_nodes elasticsearch_indices_docs_primary elasticsearch_cluster_health_status

In our setup we use an Mesos / DCOS cluster, in this cluster we perform a HTTP check against all of our Prometheus exporters. We check the /metrics path for a 2xx return code. The return code of the /metrics path is still a 200 return code. Could this be a scenario where the return code changes?

zwopir commented 5 years ago

Hey @a-nldisr,

I'm sorry you had issues with AWS ES and the exporter. However this problem can't be solved in the exporter itself. The logs complains about a 503 response code. This is the response code returned from the exporter-internal call to the ES API. If the API doesn't return a valid json, there's no way of reading the cluster status (green, yellow, red) from it. In such a case the exporter will export the metrics elasticsearch_node_stats_up and/or elasticsearch_cluster_health_up to 0. Please note that prometheus will continue to export the last known state of the metrics at least some time (don't know exactly for how long, you can read about it if you dig into the prometheus staleness handling). So my proposition would be to at least also alert on those _up metrics.

I know it is somehow unsatisfying that you can't alert on a critical cluster state in such a scenario. But it's part of the problem: If you can't query the status, you must assume the worst :/

The exporter itself never returns http status code other than 200 on the /metrics path

a-nldisr commented 5 years ago

Hey @zwopir Clear.

So i expected at least to have either elasticsearch_cluster_health_number_of_nodes or elasticsearch_cluster_health_status report when all fails, you tell me this has a dependency and i should use other metrics.

I see also that i could have prevented this by looking at the https://github.com/justwatchcom/elasticsearch_exporter/blob/master/examples/prometheus/elasticsearch.rules

So ill close this ticket.

Thanks for the reply! :)