prometheus-community / smartctl_exporter

Export smartctl statistics to prometheus
Apache License 2.0
288 stars 86 forks source link

incorrect smartctl_devices count with hardware HBAs #236

Open aieri opened 1 month ago

aieri commented 1 month ago

I have a server in which smartctl_exporter reports an incorrect number of devices:

root@server:~# curl -s localhost:10201/metrics | grep devices
# HELP smartctl_devices Number of devices configured or dynamically discovered
# TYPE smartctl_devices gauge
smartctl_devices 6

root@server:~# lsblk -o NAME,MODEL,SERIAL -d | grep -v loop
NAME    MODEL            SERIAL
sda     MTFDDAV480TDS-1A 324EC57C
sdb     SSDSC2KB240G8L   PHYF207300QZ240AGN
sdc     SSDSC2KB240G8L   PHYF20740177240AGN
nvme0n1 SSDPF2KX019T9L   PHAB123403KS1P9SGN

iirc the exporter uses smartctl --scan in the readSMARTctlDevices function taking to collect the list of devices. Indeed smartctl returns some duplicates:

root@server:~# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device

root@server:~# smartctl -i /dev/sdb | grep -i serial
Serial Number:    PHYF207300QZ240AGN
root@server:~# smartctl -i /dev/bus/0 -d megaraid,0 | grep -i serial
Serial Number:    PHYF207300QZ240AGN
root@server:~# smartctl -i /dev/sdc | grep -i serial
Serial Number:    PHYF20740177240AGN
root@server:~# smartctl -i /dev/bus/0 -d megaraid,1 | grep -i serial
Serial Number:    PHYF20740177240AGN

The exporter should probably have some extra logic to deduplicate devices that can be accessed in multiple ways. This should be possible by using either the serial or the WWN as unique identifier, e.g.:

root@server:~# smartctl -i /dev/bus/0 -d megaraid,1 --json | jq -r '.serial_number, .wwn.id'
PHYF20740177240AGN
5724752636
root@server:~# smartctl -i /dev/sdc --json | jq -r '.serial_number, .wwn.id'
PHYF20740177240AGN
5724752636
k0ste commented 1 month ago

I have a server in which smartctl_exporter reports an incorrect number of devices:

Add exclude regex --smartctl.device-exclude=^/dev/bus/[0-9]+$ for avoid scanning megaraid devices

aieri commented 1 month ago

thanks, this would indeed work given the specifics of this one server. I am however working on mass deployments of smartctl_exporter in which manual configuration is not feasible. While our automation could provide an autoconfiguration layer, it'd effectively duplicate what the exporter is already doing. I think solving this at the lowest layer would be preferable.

k0ste commented 1 month ago

thanks, this would indeed work given the specifics of this one server. I am however working on mass deployments of smartctl_exporter in which manual configuration is not feasible. While our automation could provide an autoconfiguration layer, it'd effectively duplicate what the exporter is already doing. I think solving this at the lowest layer would be preferable.

Solutions, at the lowest level, by magic, unfortunately, are impossible. The administrator will still have to choose which polling protocol (sata or megaraid) to consider as a priority. I offered you an option that you can put in your IaC. This is not a solution for a specific server, this is a solution for the megaraid controller. I do not think that you will be able to find any other controllers in your device park, if you still can - share the regular expression

P.S.: see #205

aieri commented 1 month ago

I quite disagree that deduplicating via a unique key and applying some heuristic to choose the polling protocol is magic, and I also don't appreciate the bitterness. But sure, if this suggestion is unwelcome I'll figure something out in a higher layer