prometheus-community / elasticsearch_exporter

Elasticsearch stats exporter for Prometheus
Apache License 2.0
1.93k stars 790 forks source link

elasticsearchv8 cluster_settings unmarshal error #840

Open Jjungs7 opened 10 months ago

Jjungs7 commented 10 months ago

Related issue: #509

I was testing elasticsearch_exporter on elasticsearch-8.x and found unmarshal error with --collector.clustersettings flag on

  1. Environment
    • elasticsearch_exporter: v1.7.0
    • elasticsearch: 8.11.3 (and all versions that include the new setting cluster.routing.allocation.disk.watermark.low.max_headroom"
    • golang: 1.21.3
  2. Steps to reproduce
    
    # Prepare elasticsearch_exportetr
    go build .
    ./elasticsearch_exporter --es.uri http://localhost:9200 --log.level debug --collector.clustersettings --es.all --es.indices_settings --es.shards

run elasticsearchv8 instance

docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" -e"xpack.security.enabled=false" elasticsearch:8.11.3

Change default settings. This removes the "cluster.routing.allocation.disk.watermark.low" field in the Defaults section

curl -XPUT -H"Content-Type: application/json" http://localhost:9200/_cluster/settings -d'{"transient":{"cluster.routing.allocation.disk.watermark.low": "90%"}}'

3. Expected result: "elasticsearch_clustersettings_allocation_watermark_low_ratio" metric in /metrics

HELP elasticsearch_clustersettings_allocation_watermark_low_ratio Low watermark for disk usage as a ratio.

TYPE elasticsearch_clustersettings_allocation_watermark_low_ratio gauge

elasticsearch_clustersettings_allocation_watermark_low_ratio 0.9

4. Actual result: unmarshal error shown in elasticsearch_exporter logs and the expected metric is not exported in /metrics

level=error ts=2023-12-25T09:56:11.492789Z caller=collector.go:189 msg="collector failed" name=clustersettings duration_seconds=0.015642208 err="json: cannot unmarshal object into Go struct field clusterSettingsWatermark.defaults.cluster.routing.allocation.disk.watermark.low of type string"

5. Possible solution: apply `flat_settings: true` when invoking "GET _cluster/settings" and refactor the struct to match the flat_settings format

Example: 
https://github.com/prometheus-community/elasticsearch_exporter/blob/b24d0ace72603cbcb99d83bf97b532a43889fcac/collector/cluster_settings.go#L113-L147

// clusterSettingsResponse is a representation of a Elasticsearch Cluster Settings type clusterSettingsResponse struct { Defaults clusterSettingsSection json:"defaults" Persistent clusterSettingsSection json:"persistent" Transient clusterSettingsSection json:"transient" }

// clusterSettingsSection is a representation of a Elasticsearch Cluster Settings type clusterSettingsSection struct { ClusterMaxShardsPerNode string json:"cluster.max_shards_per_node" ClusterRoutingAllocationBalanceDiskUsage string json:"cluster.routing.allocation.balance.disk_usage" ClusterRoutingAllocationBalanceIndex string json:"cluster.routing.allocation.balance.index" ClusterRoutingAllocationBalanceShard string json:"cluster.routing.allocation.balance.shard" ClusterRoutingAllocationBalanceThreshold string json:"cluster.routing.allocation.balance.threshold" ClusterRoutingAllocationBalanceWriteLoad string json:"cluster.routing.allocation.balance.write_load" ClusterRoutingAllocationEnable string json:"cluster.routing.allocation.enable" ClusterRoutingAllocationDiskThresholdEnabled string json:"cluster.routing.allocation.disk.threshold_enabled" ClusterRoutingAllocationDiskWatermarkFloodStage string json:"cluster.routing.allocation.disk.watermark.flood_stage" ClusterRoutingAllocationDiskWatermarkHigh string json:"cluster.routing.allocation.disk.watermark.high" ClusterRoutingAllocationDiskWatermarkLow string json:"cluster.routing.allocation.disk.watermark.low" }


https://github.com/prometheus-community/elasticsearch_exporter/blob/b24d0ace72603cbcb99d83bf97b532a43889fcac/collector/cluster_settings.go#L152

... u := c.u.ResolveReference(&url.URL{Path: "_cluster/settings"}) q := u.Query() q.Set("flat_settings", "true") q.Set("include_defaults", "true") ...

Skunnyk commented 9 months ago

I can reproduce the problem, on elasticsearch_exporter 1.7.0, with ES 8.12.0 and ES 7.17;

By setting only cluster.routing.allocation.disk.watermark.low, the unmarshall error happens, but it's ok on another cluster, and it fails there with cluster.routing.allocation.disk.watermark.flood_stage :thinking:

sysadmind commented 8 months ago

It would be really helpful to have an example of the API response from elasticsearch to use in our tests. cluster_settings_test.go has tests to cover the cluster settings endpoint including the watermark metrics. From what I understand, based on the provided error message, the problem is that instead of a string, there is now a nested json object being returned for the watermark metrics.

This could be a scenario where we need to use different structs based on elasticsearch version or customize the unmarshal.

Skunnyk commented 8 months ago

Hello! Here is an extract of a simple _cluster/settings route on a ES 7.17 cluster where I have the problem:

{                        
  "persistent" : {                     
    "cluster" : {
      "routing" : {
        "allocation" : {
          "disk" : {                  
            "watermark" : {               
              "low" : "88%",
              "flood_stage" : "100%",
              "high" : "93%"
            }                
          }
        }
      }
    }
  }
}

AFAIK, the format/hierarchy is the same with ES8, and haven't changed for ages :thinking:

Skunnyk commented 8 months ago

Ooooh, by querying it with include_defaults=true (like the exporter), we can see the following in the defaults section: ES 7.17:

          "disk" : {
            "threshold_enabled" : "true",
            "watermark" : {
              "enable_for_single_data_node" : "false",
              "flood_stage" : {
                "frozen" : "95%",
                "frozen.max_headroom" : "20GB"
              }
            },
            "include_relocations" : "true",
            "reroute_interval" : "60s"

ES 8.12:

          "disk" : {
            "threshold_enabled" : "true",
            "reroute_interval" : "60s",
            "watermark" : {
              "flood_stage" : {
                "frozen" : "95%",
                "frozen.max_headroom" : "20GB",
                "max_headroom" : "-1"
              },
              "high" : {
                "max_headroom" : "-1"
              },
              "low" : {
                "max_headroom" : "-1"
              },
              "enable_for_single_data_node" : "true"
            }
          },

So we can have an entry in persistent as cluster.routing.allocation.disk.watermark.low, and in defaults as cluster.routing.allocation.disk.watermark.low.max_headroom, can the problem comes from here?

neilschelly commented 4 months ago

I suspect this is the same bug here, but we are seeing it with another metric in the clustersettings collector:

json: cannot unmarshal object into Go struct field clusterSettingsWatermark.defaults.cluster.routing.allocation.disk.watermark.flood_stage of type string

We are on version ES 7.17.6 and the exporter version is 1.7.0. And this is the JSON response to /_cluster/settings:

{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "disk" : {
            "watermark" : {
              "low" : "92%",
              "flood_stage" : "97%",
              "high" : "95%"
            }
          }
        }
      },
      "max_shards_per_node" : "2000"
    },
    "xpack" : {
      "monitoring" : {
        "collection" : {
          "enabled" : "true"
        }
      }
    }
  },
  "transient" : { }
}