prometheus-community / smartctl_exporter

Export smartctl statistics to prometheus
Apache License 2.0
293 stars 88 forks source link

Incorrectly reports smartctl_device_smart_status=0 on drives with passing status #229

Open antifuchs opened 4 months ago

antifuchs commented 4 months ago

I've just upgraded to 2cc2249821d6417fcfff8ef8d302205d7b37b44c from 0768a400a1378872eb940b45c5e0cedf0c213402, and something is wrong in the reporting of SMART status of SATA-connected SSDs. It reports smartctl_device_smart_status=0 on these, but I believe the values should 1, according to what smartctl's JSON output reports for the drives:

Best to show an example:

:;    curl -s http://100.87.138.39:9633/metrics | grep smartctl_device_smart_status
# HELP smartctl_device_smart_status General smart status
# TYPE smartctl_device_smart_status gauge
smartctl_device_smart_status{device="nvme0"} 1
smartctl_device_smart_status{device="sda"} 0
smartctl_device_smart_status{device="sdb"} 1
smartctl_device_smart_status{device="sdc"} 1
smartctl_device_smart_status{device="sdd"} 1
smartctl_device_smart_status{device="sde"} 1
smartctl_device_smart_status{device="sdf"} 1
smartctl_device_smart_status{device="sdg"} 1
smartctl_device_smart_status{device="sdh"} 1
smartctl_device_smart_status{device="sdi"} 1
smartctl_device_smart_status{device="sdj"} 0
smartctl_device_smart_status{device="sdk"} 0
smartctl_device_smart_status{device="sdl"} 1
:;    for d in sda sdj sdk ; do echo -n "$d: " ; sudo smartctl --json -a /dev/$d | jq .smart_status.passed ; done
sda: true
sdj: true
sdk: true
:;    for d in sda sdj sdk ; do echo -n "$d: " ; sudo smartctl --json -a /dev/$d | jq .model_name ; done
sda: "SuperMicro SSD"
sdj: "Samsung SSD 870 EVO 2TB"
sdk: "Samsung SSD 870 EVO 2TB"

I'm not sure what's going on there, but something is wrong and it's making my disk badness monitoring go off spuriously /:

k0ste commented 4 months ago

@antifuchs, the problem may be less mysterious if you show the debug log

antifuchs commented 4 months ago

Running:

smartctl_exporter \
     --log.level=debug \
     --smartctl.path=/nix/store/whfmc5r1irm9j3n9glzxc77cl50241y2-smartmontools-7.4/bin/smartctl \
     --smartctl.interval=10m \
     --web.listen-address=127.0.0.1:9633 2>&1 | tee ~mess/debug-log

yields this (which doesn't look particularly enlightening tbh):

ts=2024-05-11T19:34:38.019Z caller=main.go:167 level=info msg="Starting smartctl_exporter" version="(version=, branch=, revision=unknown)"
ts=2024-05-11T19:34:38.019Z caller=main.go:168 level=info msg="Build context" build_context="(go=go1.22.2, platform=linux/amd64, user=, date=, tags=unknown)"
ts=2024-05-11T19:34:38.020Z caller=readjson.go:79 level=debug msg="Scanning for devices"
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sda
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdb
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdc
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdd
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sde
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdf
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdg
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdh
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdi
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdj
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdk
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdl
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=nvme0
ts=2024-05-11T19:34:38.046Z caller=main.go:172 level=info msg="Number of devices found" count=13
ts=2024-05-11T19:34:38.046Z caller=main.go:185 level=info msg="Start background scan process"
ts=2024-05-11T19:34:38.047Z caller=main.go:186 level=info msg="Rescanning for devices every" rescanInterval=10m0s
ts=2024-05-11T19:34:38.069Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sda duration=21.995655ms
ts=2024-05-11T19:34:38.069Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sda family=unknown model=unknown
ts=2024-05-11T19:34:38.094Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdb duration=24.664627ms
ts=2024-05-11T19:34:38.094Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdb family=unknown model=unknown
ts=2024-05-11T19:34:38.129Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdc duration=34.2836ms
ts=2024-05-11T19:34:38.130Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdc family=unknown model=unknown
ts=2024-05-11T19:34:38.157Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdd duration=26.83563ms
ts=2024-05-11T19:34:38.157Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdd family=unknown model=unknown
ts=2024-05-11T19:34:38.183Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sde duration=25.518334ms
ts=2024-05-11T19:34:38.184Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sde family=unknown model=unknown
ts=2024-05-11T19:34:38.212Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdf duration=27.646302ms
ts=2024-05-11T19:34:38.212Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdf family=unknown model=unknown
ts=2024-05-11T19:34:38.247Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdg duration=34.147328ms
ts=2024-05-11T19:34:38.247Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdg family=unknown model=unknown
ts=2024-05-11T19:34:38.275Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdh duration=27.762252ms
ts=2024-05-11T19:34:38.275Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdh family=unknown model=unknown
ts=2024-05-11T19:34:38.309Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdi duration=33.025595ms
ts=2024-05-11T19:34:38.309Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdi family=unknown model=unknown
ts=2024-05-11T19:34:38.333Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdj duration=23.642763ms
ts=2024-05-11T19:34:38.333Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdj family=unknown model=unknown
ts=2024-05-11T19:34:38.354Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdk duration=20.821869ms
ts=2024-05-11T19:34:38.355Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdk family=unknown model=unknown
ts=2024-05-11T19:34:38.388Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdl duration=32.682864ms
ts=2024-05-11T19:34:38.388Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdl family=unknown model=unknown
ts=2024-05-11T19:34:38.415Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=nvme0 duration=25.873639ms
ts=2024-05-11T19:34:38.415Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=nvme0 family=unknown model="Samsung SSD 980 PRO 2TB"
ts=2024-05-11T19:34:38.417Z caller=tls_config.go:313 level=info msg="Listening on" address=127.0.0.1:9633
ts=2024-05-11T19:34:38.417Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=127.0.0.1:9633
ts=2024-05-11T19:34:41.304Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sda family=unknown model=unknown
ts=2024-05-11T19:34:41.304Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdb family=unknown model=unknown
ts=2024-05-11T19:34:41.305Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdc family=unknown model=unknown
ts=2024-05-11T19:34:41.305Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdd family=unknown model=unknown
ts=2024-05-11T19:34:41.305Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sde family=unknown model=unknown
ts=2024-05-11T19:34:41.306Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdf family=unknown model=unknown
ts=2024-05-11T19:34:41.306Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdg family=unknown model=unknown
ts=2024-05-11T19:34:41.307Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdh family=unknown model=unknown
ts=2024-05-11T19:34:41.307Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdi family=unknown model=unknown
ts=2024-05-11T19:34:41.308Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdj family=unknown model=unknown
ts=2024-05-11T19:34:41.308Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdk family=unknown model=unknown
ts=2024-05-11T19:34:41.308Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdl family=unknown model=unknown
ts=2024-05-11T19:34:41.308Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=nvme0 family=unknown model="Samsung SSD 980 PRO 2TB"
k0ste commented 4 months ago

Seems your system is also affected with #205, because your NVMe device metrics was reads correctly You use packages from distro? It's will be better, if distro use releases tarball, instead development repo

antifuchs commented 4 months ago

yeah, I have been building from source - that worked while the repo was semi-maintained (and I had pull reqs outstanding), but doesn't anymore. I will reconsider.