Do not raise exception in case of non-zero exit code of smartctl

enrico2828 commented 2 years ago

I tried out this project on our servers and noticed that a disk failure is not correctly handled. The smartctl tool detects a failure, and exits with error code 8. In the python script line 24 this is handled as an exception and the script stops. Hence, for the defect disk we do not have any prometheus metrics at all. If I delete the statement that raises the exception, I get the metrics in prometheus correctly and am able to detect the disk failure. I checked in a similar project, https://github.com/PhilipMay/smart-prom-next/blob/main/smart_prom_next/smart_prom_next.py, and here a non zero exit code is handled with showing a warning instead of raising an exception.

Exception: Command returned code 8. Stdout: '{"json_format_version":[1,0],"smartctl":{"version":[7,3],"svn_revision":"5338","platform_info":"x86_64-linux-4.18.0-305.49.1.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","-A","-H","-d","scsi","--json=c","/dev/sdaq"],"exit_status":8},"local_time":{"time_t":1662530983,"asctime":"Wed Sep  7 06:09:43 2022 UTC"},"device":{"name":"/dev/sdaq","info_name":"/dev/sdaq","type":"scsi","protocol":"SCSI"},"smart_status":{"passed":false,"scsi":{"asc":93,"ascq":50,"ie_string":"DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH"}},"temperature":{"current":38,"drive_trip":60},"power_on_time":{"hours":8585,"minutes":43},"scsi_start_stop_cycle_counter":{"year_of_manufacture":"2021","week_of_manufacture":"25","specified_cycle_count_over_device_lifetime":50000,"accumulated_start_stop_cycles":54,"specified_load_unload_count_over_device_lifetime":600000,"accumulated_load_unload_cycles":402},"scsi_grown_defect_list":29083}' Stderr: ''

{
    "json_format_version": [1, 0],
    "smartctl": {
        "version": [7, 3],
        "svn_revision": "5338",
        "platform_info": "x86_64-linux-4.18.0-305.49.1.el8_4.x86_64",
        "build_info": "(local build)",
        "argv": ["smartctl", "-A", "-H", "-d", "scsi", "--json=c", "/dev/sdaq"],
        "exit_status": 8
    },
    "local_time": {
        "time_t": 1662539260,
        "asctime": "Wed Sep  7 08:27:40 2022 UTC"
    },
    "device": {
        "name": "/dev/sdaq",
        "info_name": "/dev/sdaq",
        "type": "scsi",
        "protocol": "SCSI"
    },
    "smart_status": {
        "passed": false,
        "scsi": {
            "asc": 93,
            "ascq": 50,
            "ie_string": "DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH"
        }
    },
    "temperature": {
        "current": 38,
        "drive_trip": 60
    },
    "power_on_time": {
        "hours": 8588,
        "minutes": 1
    },
    "scsi_start_stop_cycle_counter": {
        "year_of_manufacture": "2021",
        "week_of_manufacture": "25",
        "specified_cycle_count_over_device_lifetime": 50000,
        "accumulated_start_stop_cycles": 54,
        "specified_load_unload_count_over_device_lifetime": 600000,
        "accumulated_load_unload_cycles": 403
    },
    "scsi_grown_defect_list": 29083
}

matt400 commented 2 years ago

Do we need better error handling? Because change from #43 will spam this long stdout everytime some command executes. Or maybe debug mode.

ngosang commented 2 years ago

Fixed in https://github.com/matusnovak/prometheus-smartctl/commit/4807aeac41a9e82d09317616d36cd58346fcf214 v2.1.1

matusnovak / prometheus-smartctl

Do not raise exception in case of non-zero exit code of smartctl #42