I tried out this project on our servers and noticed that a disk failure is not correctly handled. The smartctl tool detects a failure, and exits with error code 8. In the python script line 24 this is handled as an exception and the script stops. Hence, for the defect disk we do not have any prometheus metrics at all.
If I delete the statement that raises the exception, I get the metrics in prometheus correctly and am able to detect the disk failure.
I checked in a similar project, https://github.com/PhilipMay/smart-prom-next/blob/main/smart_prom_next/smart_prom_next.py, and here a non zero exit code is handled with showing a warning instead of raising an exception.
Exception: Command returned code 8. Stdout: '{"json_format_version":[1,0],"smartctl":{"version":[7,3],"svn_revision":"5338","platform_info":"x86_64-linux-4.18.0-305.49.1.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","-A","-H","-d","scsi","--json=c","/dev/sdaq"],"exit_status":8},"local_time":{"time_t":1662530983,"asctime":"Wed Sep 7 06:09:43 2022 UTC"},"device":{"name":"/dev/sdaq","info_name":"/dev/sdaq","type":"scsi","protocol":"SCSI"},"smart_status":{"passed":false,"scsi":{"asc":93,"ascq":50,"ie_string":"DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH"}},"temperature":{"current":38,"drive_trip":60},"power_on_time":{"hours":8585,"minutes":43},"scsi_start_stop_cycle_counter":{"year_of_manufacture":"2021","week_of_manufacture":"25","specified_cycle_count_over_device_lifetime":50000,"accumulated_start_stop_cycles":54,"specified_load_unload_count_over_device_lifetime":600000,"accumulated_load_unload_cycles":402},"scsi_grown_defect_list":29083}' Stderr: ''
I tried out this project on our servers and noticed that a disk failure is not correctly handled. The smartctl tool detects a failure, and exits with error code 8. In the python script line 24 this is handled as an exception and the script stops. Hence, for the defect disk we do not have any prometheus metrics at all. If I delete the statement that raises the exception, I get the metrics in prometheus correctly and am able to detect the disk failure. I checked in a similar project, https://github.com/PhilipMay/smart-prom-next/blob/main/smart_prom_next/smart_prom_next.py, and here a non zero exit code is handled with showing a warning instead of raising an exception.