prometheus-community / node-exporter-textfile-collector-scripts

Scripts for node-exporter's textfile collector
Apache License 2.0
512 stars 191 forks source link

nvme_metrics.sh invalidly quotes numbers. #130

Closed isomer closed 1 year ago

isomer commented 1 year ago

jq on my system (debian sid) outputs numbers in a quoted format. eg:

# HELP nvme_host_write_commands_total SMART metric host_write_commands_total
# TYPE nvme_host_write_commands_total counter
nvme_host_write_commands_total{device="nvme0n1"} "432007"

This causes errors in the journal:

Dec 03 20:28:25 windy prometheus-node-exporter[771]: ts=2022-12-03T20:28:25.693Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=nvme.prom err="failed to parse textfile data from \"/var/lib/prometheus/node-exporter/nvme.prom\": text format parsing error in line 12: expected float as value, got \"\\\"1\\\"\""

Patching the script to call "jq -r" (raw output) resolves this problem.

This appears to be because nvme smart-log -o json for some reason quotes some of the numbers.

dswarbrick commented 1 year ago

That's a bit curious. On a system running Debian testing (i.e., bookworm pre-release), all numerical values are unquoted:

# nvme --version
nvme version 1.16
# nvme smart-log -o json /dev/nvme0n1
{
  "critical_warning" : 0,
  "temperature" : 310,
  "avail_spare" : 100,
  "spare_thresh" : 10,
  "percent_used" : 1,
  "endurance_grp_critical_warning_summary" : 0,
  "data_units_read" : 439253,
  "data_units_written" : 6715929,
  "host_read_commands" : 5670057,
  "host_write_commands" : 333564229,
  "controller_busy_time" : 2255,
  "power_cycles" : 74,
  "power_on_hours" : 30803,
  "unsafe_shutdowns" : 52,
  "media_errors" : 0,
  "num_err_log_entries" : 183,
  "warning_temp_time" : 0,
  "critical_comp_time" : 0,
  "thm_temp1_trans_count" : 0,
  "thm_temp2_trans_count" : 0,
  "thm_temp1_total_time" : 0,
  "thm_temp2_total_time" : 0
}

It appears that the version of nvme-cli in sid has changed this behaviour:

# nvme --version
nvme version 2.2.1 (git 2.2.1)
libnvme version 1.2 (git 1.2)
{
  "critical_warning":0,
  "temperature":310,
  "avail_spare":100,
  "spare_thresh":10,
  "percent_used":1,
  "endurance_grp_critical_warning_summary":0,
  "data_units_read":"439254",
  "data_units_written":"6716149",
  "host_read_commands":"5670075",
  "host_write_commands":"333566765",
  "controller_busy_time":"2255",
  "power_cycles":"74",
  "power_on_hours":"30803",
  "unsafe_shutdowns":"52",
  "media_errors":"0",
  "num_err_log_entries":"183",
  "warning_temp_time":0,
  "critical_comp_time":0,
  "thm_temp1_trans_count":0,
  "thm_temp2_trans_count":0,
  "thm_temp1_total_time":0,
  "thm_temp2_total_time":0
}
dswarbrick commented 1 year ago

Many of the NVMe counters are 128 bit. It appears that newer versions of nvme-cli are preserving the full 128 bits, whereas previously they were being cast to double (i.e. float64). Since JSON doesn't support 128 bit numbers, they get added to the JSON object as strings:

struct json_object *util_json_object_new_uint128(nvme_uint128_t  val)
{
    struct json_object *obj;
    obj = json_object_new_string(uint128_t_to_string(val));
    return obj;
}

@isomer Care to open a PR to fix this in the script?