prometheus-community / elasticsearch_exporter

Elasticsearch stats exporter for Prometheus
Apache License 2.0
1.93k stars 791 forks source link

Opensearch cluster node health wrong reporting #612

Open ervikrant06 opened 2 years ago

ervikrant06 commented 2 years ago

We have recently started using Opensearch 2.1.0 version in our environment as a replacement of Opendistro. AFAIU elasticsearch_exporter is not ES version specific hence it should work without any issue with opensearch.

Broady, facing two issues.

1) Prometheus intermittenly couldn't scrape the metrics from ES nodes. We are running exporter on each ES node. Faced this issue 1.3 version updated to 1.5 version but not of much help.

prometh+ 11495  0.1  0.2 714708 17036 ?        Ssl  08:28   0:01 /usr/bin/elasticsearch_exporter --es.uri=https://prometheus:prometheus@localhost:9200 --es.indices --es.ssl-skip-verify --web.listen-address=0.0.0.0:9108

elasticsearch_cluster_health_up reported as 0 on few of the nodes in cluster (sometime one node) in cluster while checking https://NODE_URL:9108/metrics . At same time other reports the state as 1 .. checking the ES cluster and node health from ES API returns everything in healthy state. Could it be a operator issue.

2) Exporter keep on giving these messages with opensearch with opendistro it was never an issue. 500 indicates the server error but ES API itself works fine. 403 is permissioning after seeing this error prometheus user mapped to monitoring role but still error keep on coming.

Aug 10 08:42:48 elasticsearch-master-1 elasticsearch_exporter[11495]: level=warn ts=2022-08-10T08:42:48.930801946Z caller=cluster_health.go:286 msg="failed to fetch and decode cluster health" err="HTTP Request failed with code 500"
Aug 10 08:43:10 elasticsearch-master-1 elasticsearch_exporter[11495]: level=warn ts=2022-08-10T08:43:10.96161973Z caller=cluster_health.go:286 msg="failed to fetch and decode cluster health" err="HTTP Request failed with code 500"

Aug 10 08:52:18 elasticsearch-master-1 elasticsearch_exporter[11495]: level=warn ts=2022-08-10T08:52:18.928680081Z caller=indices.go:1207 msg="failed to fetch and decode index stats" err="HTTP Request failed with code 403"
Aug 10 08:52:40 elasticsearch-master-1 elasticsearch_exporter[11495]: level=error ts=2022-08-10T08:52:40.919025987Z caller=indices.go:1126 err="HTTP Request failed with code 403"
Aug 10 08:52:40 elasticsearch-master-1 elasticsearch_exporter[11495]: level=warn ts=2022-08-10T08:52:40.919084396Z caller=indices.go:1207 msg="failed to fetch and decode index stats" err="HTTP Request failed with code 403"
Aug 10 08:52:48 elasticsearch-master-1 elasticsearch_exporter[11495]: level=error ts=2022-08-10T08:52:48.932423852Z caller=indices.go:1126 err="HTTP Request failed with code 403"
Aug 10 08:52:48 elasticsearch-master-1 elasticsearch_exporter[11495]: level=warn ts=2022-08-10T08:52:48.932494568Z caller=indices.go:1207 msg="failed to fetch and decode index stats" err="HTTP Request failed with code 403"

These two issues seems to be inter-related but couldn't find why sometime it start failing to decode cluster health.

ervikrant06 commented 2 years ago

elasticsearch_clusterinfo_up is with result 1 and elasticsearch_cluster_health_up with result 0. No Json parse failure.

# HELP elasticsearch_cluster_health_json_parse_failures Number of errors while parsing JSON.
# TYPE elasticsearch_cluster_health_json_parse_failures counter
elasticsearch_cluster_health_json_parse_failures 0
# HELP elasticsearch_cluster_health_total_scrapes Current total ElasticSearch cluster health scrapes.
# TYPE elasticsearch_cluster_health_total_scrapes counter
elasticsearch_cluster_health_total_scrapes 44917
# HELP elasticsearch_cluster_health_up Was the last scrape of the ElasticSearch cluster health endpoint successful.
# TYPE elasticsearch_cluster_health_up gauge
elasticsearch_cluster_health_up 0
# HELP elasticsearch_clusterinfo_last_retrieval_failure_ts Timestamp of the last failed cluster info retrieval
# TYPE elasticsearch_clusterinfo_last_retrieval_failure_ts gauge
elasticsearch_clusterinfo_last_retrieval_failure_ts{url="https://localhost:9200"} 1.659449338e+09
# HELP elasticsearch_clusterinfo_last_retrieval_success_ts Timestamp of the last successful cluster info retrieval
# TYPE elasticsearch_clusterinfo_last_retrieval_success_ts gauge
elasticsearch_clusterinfo_last_retrieval_success_ts{url="https://localhost:9200"} 1.660122238e+09
# HELP elasticsearch_clusterinfo_up Up metric for the cluster info collector
# TYPE elasticsearch_clusterinfo_up gauge
elasticsearch_clusterinfo_up{url="https://localhost:9200"} 1