prometheus-community / elasticsearch_exporter

Elasticsearch stats exporter for Prometheus
Apache License 2.0
1.92k stars 792 forks source link

`elasticsearch_node_shards_total` dynamically adds metrics with new labels #663

Open Sebbo94BY opened 1 year ago

Sebbo94BY commented 1 year ago

Version 1.4.0 of this exporter introduced the new metric elasticsearch_node_shards_total, which can be enabled, if required. This was introduced by https://github.com/prometheus-community/elasticsearch_exporter/pull/535.

I've enabled this in our Elasticsearch setup as we've built some monitoring alerts based on it:

--es.uri='http://127.0.0.1:9200' \
    --web.listen-address=':9112' \
    --es.shards \
    --es.indices_settings

When a node restarts / crashes or whatever and reallocates / moves a shard, this causes the following Prometheus expression...

sum(elasticsearch_node_shards_total{hostname_short=~".*-01"}) by (node)

....to show something like this for example:

metric | value
------------------
{node="elasticsearch-01-01"} | 297
{node="elasticsearch-01-02"} | 291
{node="elasticsearch-01-03"} | 298
{node="elasticsearch-01-04"} | 297
{node="elasticsearch-01-05"} | 297
{node="elasticsearch-01-06"} | 298
{node="elasticsearch-01-06 -> 192.168.2.13 WGCtl2PHSTG-NVXziiUETQ elasticsearch-01-09"} | 1
{node="elasticsearch-01-07"} | 101
{node="elasticsearch-01-08"} | 99
{node="elasticsearch-01-09"} | 107

This dynamically creates new metrics with a unique label like node="elasticsearch-01-06 -> 192.168.2.13 WGCtl2PHSTG-NVXziiUETQ elasticsearch-01-09".

If the cluster should reallocate a lot of shards due to whatever reason, this will result in a lot of new (temporary) metrics, which could lead to metric/label explosions in Prometheus.

It would be great, if those reallocating shard metrics could be turned off or needs to be explicitly enabled to avoid having these metrics at all.

kvadaliyasalesforce commented 10 months ago

can we have indices information also embedded in this metric?

evheniyt commented 10 months ago

also, I could see that the cluster label is missing for this metric