prometheus-community / elasticsearch_exporter

Elasticsearch stats exporter for Prometheus
Apache License 2.0
1.92k stars 792 forks source link

[helm][es-exporter] Metrics are delayed with a single node ES cluster with low Prometheus scrape interval #466

Open LaurentDumont opened 3 years ago

LaurentDumont commented 3 years ago

Hey everyone,

I'm trying to leverage the exporter to get some "real-time" statistics regarding our ELK cluster.

It's a singlenode and the usage is pretty low so I know that it's quick to answer (looking at the Dev tools for the request time for /_nodes/stats and /_all/_stats show that it takes about 100ms each to answer.

I've setup Prometheus to scrape the exporter every 10 seconds, thinking that it would let the metric refresh in between and I could get enough granularity.

But I can see that the retrieval metric is around 2 minutes still - using elasticsearch_clusterinfo_last_retrieval_success_ts (if it means what I think it does)

image

Prometheus config

- job_name: elasticsearch-exporter
  honor_timestamps: true
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  static_configs:
  - targets:
    - URL_HERE:9108

I can see that Prometheus does scrape the exporter every 10 seconds in the Targets page of the server.

But if I look at the stats actually collected, it can take up to 20 minutes for the data to update (for metrics that I know should move faster)

Using elasticsearch_indices_docs_total for an indice that I know is actively receiving docs.

image

Any ideas?

sysadmind commented 3 years ago

It's hard to tell exactly what the second graph is showing because the time axis is not shown. I need to take a look at the docs_total metric and how elasticsearch handles the metric internally, but that could be related to a refresh (elasticsearch term), but I'm not potitive. The elasticsearch_clusterinfo_last_retrieval_success_ts metric actually records the timestamp of the most recent success, so that will go up forever (as time always moves forward). That metric actually looks correct to me. The cluster info is not updated every scape. See the --es.clusterinfo.interval flag on the executable.

LaurentDumont commented 3 years ago

Ah, got for clusterinfo, I do see the updated log from the exporter itself.

That said, looking at other metrics, there always seem to be an interval of increase that I cannot match to the scrape interval. You can ignore the small bump 15:20, this was us restarting the exporter.

Assuming the flow is HttpRequest to /metrics for exporter --> Exporter --> ElasticSearch --> Fetch current counters from /_nodes/stats and /_all/_stats, I can't explain the slow updates of the metrics in Prometheus.

image

I guess it's possible that there is some hidden aggregation done inside Elasticsearch. But if the out is parsed from /_nodes/stats and /_all/_stats, I can see it clearly change in the ELK Tools tab.

image

LaurentDumont commented 3 years ago

So there could also be something else in play. Looking at another metric + switching to a 1 minute interval, I can clearly see it updates every 60 seconds.

image