Closed a-nldisr closed 5 years ago
Hey @a-nldisr,
I'm sorry you had issues with AWS ES and the exporter. However this problem can't be solved in the exporter itself. The logs complains about a 503 response code. This is the response code returned from the exporter-internal call to the ES API. If the API doesn't return a valid json, there's no way of reading the cluster status (green, yellow, red) from it.
In such a case the exporter will export the metrics elasticsearch_node_stats_up
and/or elasticsearch_cluster_health_up
to 0
.
Please note that prometheus will continue to export the last known state of the metrics at least some time (don't know exactly for how long, you can read about it if you dig into the prometheus staleness handling). So my proposition would be to at least also alert on those _up
metrics.
I know it is somehow unsatisfying that you can't alert on a critical cluster state in such a scenario. But it's part of the problem: If you can't query the status, you must assume the worst :/
The exporter itself never returns http status code other than 200 on the /metrics path
Hey @zwopir Clear.
So i expected at least to have either elasticsearch_cluster_health_number_of_nodes
or elasticsearch_cluster_health_status
report when all fails, you tell me this has a dependency and i should use other metrics.
I see also that i could have prevented this by looking at the https://github.com/justwatchcom/elasticsearch_exporter/blob/master/examples/prometheus/elasticsearch.rules
So ill close this ticket.
Thanks for the reply! :)
We lost an Elasticsearch cluster in our Acceptance environment during the AWS outage in Frankfurt availability zone C last tuesday. We had a couple of alerts setup to monitor clusters for red state, however the exporter is having some issues and we never received an alert till after all the dust settled. We found out that this exporter broke when the cluster turned red because we lost all of our data nodes.
These are the logs from the exporter:
exporter version:
1.0.2
How to replicate:
When i query this exporter i still get metrics, from the masters, one data node.
Some of these metrics are no longer available:
elasticsearch_cluster_health_number_of_nodes
elasticsearch_indices_docs_primary
elasticsearch_cluster_health_status
In our setup we use an Mesos / DCOS cluster, in this cluster we perform a HTTP check against all of our Prometheus exporters. We check the /metrics path for a 2xx return code. The return code of the /metrics path is still a 200 return code. Could this be a scenario where the return code changes?