librenms / librenms-agent

LibreNMS Agent & Scripts
GNU General Public License v2.0
119 stars 189 forks source link

Smart agent reports only null and zero values for SAS drives #390

Open snowsnoot opened 2 years ago

snowsnoot commented 2 years ago

Smart script only prints 'null' and '0' values for my SAS drives. I get a few more values for my SATA SSD, but still a few null's:

sda is the SSD, sdb - sdi are SAS drives (HP MB3000FBUCN)

# ./smart -c /etc/snmp/smart.config
sda,null,null,0,null,null,null,0,null,null,22,0,null,null,0,98,3649,0,0,0,0,0,0,0,0,9319
sdb,0,null,null,null,null,null,null,null,null,38,null,null,null,null,null,null,0,0,0,0,0,0,0,0,null
sdc,0,null,null,null,null,null,null,null,null,39,null,null,null,null,null,null,0,0,0,0,0,0,0,0,null
sdd,47,null,null,null,null,null,null,null,null,37,null,null,null,null,null,null,0,0,0,0,0,0,0,0,null
sde,244,null,null,null,null,null,null,null,null,40,null,null,null,null,null,null,0,0,0,0,0,0,0,0,null
sdf,0,null,null,null,null,null,null,null,null,39,null,null,null,null,null,null,0,0,0,0,0,0,0,0,null
sdg,105,null,null,null,null,null,null,null,null,38,null,null,null,null,null,null,0,0,0,0,0,0,0,0,null
sdh,1,null,null,null,null,null,null,null,null,39,null,null,null,null,null,null,0,0,0,0,0,0,0,0,null
sdi,63,null,null,null,null,null,null,null,null,37,null,null,null,null,null,null,0,0,0,0,0,0,0,0,null

Config file:

# cat /etc/snmp/smart.config
useSN=0
smartctl=/usr/sbin/smartctl
cache=/var/cache/smart/cache
sda /dev/sda -d sat
sdb /dev/sdb -d scsi
sdc /dev/sdc -d scsi
sdd /dev/sdd -d scsi
sde /dev/sde -d scsi
sdf /dev/sdf -d scsi
sdg /dev/sdg -d scsi
sdh /dev/sdh -d scsi
sdi /dev/sdi -d scsi
napaster commented 2 years ago

Same story with sas disks

JvGinkel commented 2 years ago

The smart script is parsing the smartctl output and use specific ID's and output that as you can see here https://github.com/librenms/librenms-agent/blob/master/snmp/smart#L442-L444

$toReturn=$toReturn.$disk_id.','.$IDs{'5'}.','.$IDs{'10'}.','.$IDs{'173'}.','.$IDs{'177'}.','.$IDs{'183'}.','.$IDs{'184'}.','.$IDs{'187'}.','.$IDs{'188'}
        .','.$IDs{'190'} .','.$IDs{'194'}.','.$IDs{'196'}.','.$IDs{'197'}.','.$IDs{'198'}.','.$IDs{'199'}.','.$IDs{'231'}.','.$IDs{'233'}.','.
        $completed.','.$interrupted.','.$read_failure.','.$unknown_failure.','.$extended.','.$short.','.$conveyance.','.$selective.','.$IDs{'9'}."\n";

So maybe you can do a smartctl on the cli and then see which ID's you get returned with what value and if that's the same as this smartctl output produce.

For example one of my disks gives:

smartctl -a /dev/sda 

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   080   080   000    Old_age   Always       -       98683
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       42
177 Wear_Leveling_Count     0x0013   095   095   000    Pre-fail  Always       -       53
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   056   048   000    Old_age   Always       -       44
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       27
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       30079114404

You can see that for example ID's 10 and 173 are missing so these are null values in the script output as there is nothing to parse.

rci-kmccolm commented 2 years ago

I think the issue is with how the script is parsing the output of smartctl. Here is the output that smartctl gives agasinst my SAS drive. As you can see, it is very different from what you posted.

# smartctl -a /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.16-200.fc36.x86_64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              MB3000FBUCN
Revision:             HPD2
Compliance:           SPC-3
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca01a7830fc
Serial number:        YHJ4342D
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu Nov  3 09:39:26 2022 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     40 C
Drive Trip Temperature:        65 C

Accumulated power on time, hours:minutes 52796:53
Manufactured in week 12 of year 2012
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  136
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2183
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0   446238         0         0          0     146784.484           0
write:         0 49852631         0  49852631          0     209954.084           0
verify:        0       18         0        18          0        437.992           0

Non-medium error count:     2204

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   27867                 - [-   -    -]
# 2  Background short  Completed                   -   25967                 - [-   -    -]
# 3  Background short  Completed                   -   25967                 - [-   -    -]
# 4  Background short  Completed                   -   25898                 - [-   -    -]
# 5  Background long   Completed                   -   25898                 - [-   -    -]
# 6  Background short  Completed                   -      24                 - [-   -    -]
# 7  Background short  Completed                   -      21                 - [-   -    -]

Long (extended) Self-test duration: 27182 seconds [7.6 hours]