prometheus-community / node-exporter-textfile-collector-scripts

Scripts for node-exporter's textfile collector
Apache License 2.0
509 stars 189 forks source link

smartmon.sh can't collect Drive_Life_Remaining% #9

Open Liamlu28 opened 5 years ago

Liamlu28 commented 5 years ago

Special characters cannot be parse Example: Drive_Life_Remaining% SSD_LifeLeft(0.01%)

laa88rf commented 3 years ago

+1

dswarbrick commented 3 years ago

Are you able to use the smartmon.py script instead?

laa88rf commented 3 years ago

Hello. Thank you for your reply. Of course, we try to use smartmon.py. # ./smartmon.py_NEWEST | grep -v "#" smartmon_smartctl_version{version="7.2"} 1 smartmon_attr_raw_value{name="raw_read_error_rate",device="/dev/sda",disk="0"} 0 smartmon_attr_raw_value{name="power_on_hours",device="/dev/sda",disk="0"} 30079 smartmon_attr_raw_value{name="power_cycle_count",device="/dev/sda",disk="0"} 6 smartmon_attr_raw_value{name="program_fail_count",device="/dev/sda",disk="0"} 0 smartmon_attr_raw_value{name="reported_uncorrect",device="/dev/sda",disk="0"} 0 smartmon_attr_raw_value{name="temperature_celsius",device="/dev/sda",disk="0"} 38 smartmon_attr_raw_value{name="reallocated_event_count",device="/dev/sda",disk="0"} 0 smartmon_attr_raw_value{name="offline_uncorrectable",device="/dev/sda",disk="0"} 0 smartmon_attr_raw_value{name="udma_crc_error_count",device="/dev/sda",disk="0"} 0 smartmon_attr_raw_value{name="total_lbas_written",device="/dev/sda",disk="0"} 72636214020 smartmon_attr_raw_value{name="raw_read_error_rate",device="/dev/sdb",disk="0"} 2817580 smartmon_attr_raw_value{name="power_on_hours",device="/dev/sdb",disk="0"} 56476 smartmon_attr_raw_value{name="power_cycle_count",device="/dev/sdb",disk="0"} 15 smartmon_attr_raw_value{name="program_fail_count",device="/dev/sdb",disk="0"} 0 smartmon_attr_raw_value{name="reported_uncorrect",device="/dev/sdb",disk="0"} 1851 smartmon_attr_raw_value{name="temperature_celsius",device="/dev/sdb",disk="0"} 35 smartmon_attr_raw_value{name="reallocated_event_count",device="/dev/sdb",disk="0"} 2 smartmon_attr_raw_value{name="offline_uncorrectable",device="/dev/sdb",disk="0"} 0 smartmon_attr_raw_value{name="udma_crc_error_count",device="/dev/sdb",disk="0"} 1 smartmon_attr_raw_value{name="total_lbas_written",device="/dev/sdb",disk="0"} 134617775539 smartmon_attr_threshold{name="raw_read_error_rate",device="/dev/sda",disk="0"} 0 smartmon_attr_threshold{name="power_on_hours",device="/dev/sda",disk="0"} 0 smartmon_attr_threshold{name="power_cycle_count",device="/dev/sda",disk="0"} 0 smartmon_attr_threshold{name="program_fail_count",device="/dev/sda",disk="0"} 0 smartmon_attr_threshold{name="reported_uncorrect",device="/dev/sda",disk="0"} 0 smartmon_attr_threshold{name="temperature_celsius",device="/dev/sda",disk="0"} 0 smartmon_attr_threshold{name="reallocated_event_count",device="/dev/sda",disk="0"} 0 smartmon_attr_threshold{name="offline_uncorrectable",device="/dev/sda",disk="0"} 0 smartmon_attr_threshold{name="udma_crc_error_count",device="/dev/sda",disk="0"} 0 smartmon_attr_threshold{name="total_lbas_written",device="/dev/sda",disk="0"} 0 smartmon_attr_threshold{name="raw_read_error_rate",device="/dev/sdb",disk="0"} 0 smartmon_attr_threshold{name="power_on_hours",device="/dev/sdb",disk="0"} 0 smartmon_attr_threshold{name="power_cycle_count",device="/dev/sdb",disk="0"} 0 smartmon_attr_threshold{name="program_fail_count",device="/dev/sdb",disk="0"} 0 smartmon_attr_threshold{name="reported_uncorrect",device="/dev/sdb",disk="0"} 0 smartmon_attr_threshold{name="temperature_celsius",device="/dev/sdb",disk="0"} 0 smartmon_attr_threshold{name="reallocated_event_count",device="/dev/sdb",disk="0"} 0 smartmon_attr_threshold{name="offline_uncorrectable",device="/dev/sdb",disk="0"} 0 smartmon_attr_threshold{name="udma_crc_error_count",device="/dev/sdb",disk="0"} 0 smartmon_attr_threshold{name="total_lbas_written",device="/dev/sdb",disk="0"} 0 smartmon_attr_value{name="raw_read_error_rate",device="/dev/sda",disk="0"} 100 smartmon_attr_value{name="power_on_hours",device="/dev/sda",disk="0"} 100 smartmon_attr_value{name="power_cycle_count",device="/dev/sda",disk="0"} 100 smartmon_attr_value{name="program_fail_count",device="/dev/sda",disk="0"} 100 smartmon_attr_value{name="reported_uncorrect",device="/dev/sda",disk="0"} 100 smartmon_attr_value{name="temperature_celsius",device="/dev/sda",disk="0"} 62 smartmon_attr_value{name="reallocated_event_count",device="/dev/sda",disk="0"} 100 smartmon_attr_value{name="offline_uncorrectable",device="/dev/sda",disk="0"} 100 smartmon_attr_value{name="udma_crc_error_count",device="/dev/sda",disk="0"} 100 smartmon_attr_value{name="total_lbas_written",device="/dev/sda",disk="0"} 100 smartmon_attr_value{name="raw_read_error_rate",device="/dev/sdb",disk="0"} 100 smartmon_attr_value{name="power_on_hours",device="/dev/sdb",disk="0"} 100 smartmon_attr_value{name="power_cycle_count",device="/dev/sdb",disk="0"} 100 smartmon_attr_value{name="program_fail_count",device="/dev/sdb",disk="0"} 100 smartmon_attr_value{name="reported_uncorrect",device="/dev/sdb",disk="0"} 100 smartmon_attr_value{name="temperature_celsius",device="/dev/sdb",disk="0"} 65 smartmon_attr_value{name="reallocated_event_count",device="/dev/sdb",disk="0"} 100 smartmon_attr_value{name="offline_uncorrectable",device="/dev/sdb",disk="0"} 100 smartmon_attr_value{name="udma_crc_error_count",device="/dev/sdb",disk="0"} 100 smartmon_attr_value{name="total_lbas_written",device="/dev/sdb",disk="0"} 100 smartmon_attr_worst{name="raw_read_error_rate",device="/dev/sda",disk="0"} 100 smartmon_attr_worst{name="power_on_hours",device="/dev/sda",disk="0"} 100 smartmon_attr_worst{name="power_cycle_count",device="/dev/sda",disk="0"} 100 smartmon_attr_worst{name="program_fail_count",device="/dev/sda",disk="0"} 100 smartmon_attr_worst{name="reported_uncorrect",device="/dev/sda",disk="0"} 100 smartmon_attr_worst{name="temperature_celsius",device="/dev/sda",disk="0"} 49 smartmon_attr_worst{name="reallocated_event_count",device="/dev/sda",disk="0"} 100 smartmon_attr_worst{name="offline_uncorrectable",device="/dev/sda",disk="0"} 100 smartmon_attr_worst{name="udma_crc_error_count",device="/dev/sda",disk="0"} 100 smartmon_attr_worst{name="total_lbas_written",device="/dev/sda",disk="0"} 100 smartmon_attr_worst{name="raw_read_error_rate",device="/dev/sdb",disk="0"} 100 smartmon_attr_worst{name="power_on_hours",device="/dev/sdb",disk="0"} 100 smartmon_attr_worst{name="power_cycle_count",device="/dev/sdb",disk="0"} 100 smartmon_attr_worst{name="program_fail_count",device="/dev/sdb",disk="0"} 100 smartmon_attr_worst{name="reported_uncorrect",device="/dev/sdb",disk="0"} 100 smartmon_attr_worst{name="temperature_celsius",device="/dev/sdb",disk="0"} 49 smartmon_attr_worst{name="reallocated_event_count",device="/dev/sdb",disk="0"} 100 smartmon_attr_worst{name="offline_uncorrectable",device="/dev/sdb",disk="0"} 100 smartmon_attr_worst{name="udma_crc_error_count",device="/dev/sdb",disk="0"} 100 smartmon_attr_worst{name="total_lbas_written",device="/dev/sdb",disk="0"} 100 smartmon_device_active{device="/dev/sda",disk="0"} 1 smartmon_device_active{device="/dev/sdb",disk="0"} 1 smartmon_device_errors{device="/dev/sda",disk="0"} 0 smartmon_device_errors{device="/dev/sdb",disk="0"} 1851 smartmon_device_info{device="/dev/sda",disk="0",model_family="Crucial/Micron Client SSDs",device_model="Micron_1100_MTFDDAK256TBN",serial_number="17******3",firmware_version="M0MU031"} 1 smartmon_device_info{device="/dev/sdb",disk="0",model_family="Crucial/Micron Client SSDs",device_model="Crucial_CT256MX100SSD1",serial_number="14******5",firmware_version="MU03"} 1 smartmon_device_smart_available{device="/dev/sda",disk="0"} 1 smartmon_device_smart_available{device="/dev/sdb",disk="0"} 1 smartmon_device_smart_enabled{device="/dev/sda",disk="0"} 1 smartmon_device_smart_enabled{device="/dev/sdb",disk="0"} 1 smartmon_device_smart_healthy{device="/dev/sda",disk="0"} 0 smartmon_device_smart_healthy{device="/dev/sdb",disk="0"} 1 smartmon_smartctl_run{device="/dev/sda",disk="0"} 1616051028 smartmon_smartctl_run{device="/dev/sdb",disk="0"} 1616051028

# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-6-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     Micron_1100_MTFDDAK256TBN
Serial Number:    17*****3
LU WWN Device Id: 5 00a075 115acfd33
Firmware Version: M0MU031
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Mar 18 07:05:02 2021 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

General SMART Values:
Offline data collection status:  (0x06) Offline data collection activity
                    was aborted by the device with a fatal error.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (  654) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (   4) minutes.
Conveyance self-test routine
recommended polling time:    (   3) minutes.
SCT capabilities:          (0x0035) SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       30079
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       6
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       1597
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       2
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   062   049   000    Old_age   Always       -       38 (Min/Max 22/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       72636216364
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       2308805595
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       13846706729
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       2056
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Vendor (0xff)       Completed without error       00%     29688         -
# 2  Vendor (0xff)       Completed without error       00%     29605         -
# 3  Vendor (0xff)       Completed without error       00%     28706         -
# 4  Vendor (0xff)       Completed without error       00%     27823         -
# 5  Vendor (0xff)       Completed without error       00%     26820         -
# 6  Vendor (0xff)       Completed without error       00%     25728         -
# 7  Vendor (0xff)       Completed without error       00%     24587         -
# 8  Vendor (0xff)       Completed without error       00%     23338         -
# 9  Vendor (0xff)       Completed without error       00%     22109         -
#10  Vendor (0xff)       Completed without error       00%     21130         -
#11  Vendor (0xff)       Completed without error       00%     20294         -
#12  Vendor (0xff)       Completed without error       00%     19360         -
#13  Vendor (0xff)       Completed without error       00%     18370         -
#14  Vendor (0xff)       Completed without error       00%     17719         -
#15  Extended offline    Completed without error       00%     17592         -
#16  Extended offline    Completed without error       00%     17580         -
#17  Vendor (0xff)       Completed without error       00%     17543         -
#18  Vendor (0xff)       Completed without error       00%     17423         -
#19  Vendor (0xff)       Completed without error       00%     17303         -
#20  Extended offline    Completed without error       00%     17195         -
#21  Short offline       Completed without error       00%     17194         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

You have new mail in /var/mail/root

202 Percent_Lifetime_Remain 0x0030 000 000 001 Old_age Offline FAILING_NOW 100

laa88rf commented 3 years ago

So. We have another project https://github.com/micha37-martins/S.M.A.R.T-disk-monitoring-for-Prometheus and my pull request (pre-Alpha): https://github.com/laa88rf/S.M.A.R.T-disk-monitoring-for-Prometheus/blob/master/smartmon.sh

Could you please have a look?

hansmi commented 3 years ago

Some SSDs also report health information via the so-called "ATA Device Statistics". #68 implements support for those in smartmon.py.