v-zhuravlev / zbx-smartctl

Templates and scripts for monitoring disks health with Zabbix and smartmontools
https://share.zabbix.com/storage-devices/smartmontools/smart-monitoring-with-smartmontools-lld
GNU General Public License v3.0
245 stars 127 forks source link

Needs rewrite for SAS #24

Closed rkarlsba closed 5 years ago

rkarlsba commented 7 years ago

Thanks for the software - works well with SATA, but not SAS (different format). See output below.

root@cctvfs:~/src/git/zbx-smartctl/discovery-scripts/nix# smartctl -x /dev/sdag smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.2.0-4-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Vendor: SEAGATE Product: ST4000NM0023 Revision: 0004 Compliance: SPC-4 User Capacity: 4 000 787 030 016 bytes [4,00 TB] Logical block size: 512 bytes LB provisioning type: unreported, LBPME=0, LBPRZ=0 Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c5005844896b Serial number: Z1Z3EE9700009430THUT Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Tue Feb 14 02:28:31 2017 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled Read Cache is: Enabled Writeback Cache is: Enabled

=== START OF READ SMART DATA SECTION === SMART Health Status: OK

Current Drive Temperature: 46 C Drive Trip Temperature: 60 C

Manufactured in week 08 of year 2014 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 136 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 1061 Elements in grown defect list: 0

Vendor (Seagate) cache information Blocks sent to initiator = 2812943521 Blocks received from initiator = 1862020047 Blocks read from cache and sent to initiator = 1967388513 Number of read and write commands whose size <= segment size = 1104067037 Number of read and write commands whose size > segment size = 98874

Vendor (Seagate/Hitachi) factory information number of hours powered up = 22332,68 number of minutes until next internal SMART test = 17

Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 2959553965 0 0 2959553965 0 87202,134 0 write: 0 0 0 0 0 296033,679 0

Non-medium error count: 1

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on'] No self-tests have been logged

Background scan results log Status: scan is active Accumulated power on time, hours:minutes 22332:41 [1339961 minutes] Number of background scans performed: 250, scan progress: 5,23% Number of background medium scans performed: 250

Protocol Specific port log page for SAS SSP relative target port id = 1 generation code = 0 number of phys = 1 phy identifier = 0 attached device type: expander device attached reason: SMP phy control function reason: power on negotiated logical link rate: phy enabled; 6 Gbps attached initiator port: ssp=0 stp=0 smp=0 attached target port: ssp=0 stp=0 smp=1 SAS address = 0x5000c50058448969 attached SAS address = 0x50030480013e6bbf attached phy identifier = 18 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 0 Phy reset problem = 0 Phy event descriptors: Invalid word count: 0 Running disparity error count: 0 Loss of dword synchronization count: 0 Phy reset problem count: 0 relative target port id = 2 generation code = 0 number of phys = 1 phy identifier = 1 attached device type: no device attached attached reason: unknown reason: unknown negotiated logical link rate: phy enabled; unknown attached initiator port: ssp=0 stp=0 smp=0 attached target port: ssp=0 stp=0 smp=0 SAS address = 0x5000c5005844896a attached SAS address = 0x0 attached phy identifier = 0 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 0 Phy reset problem = 0 Phy event descriptors: Invalid word count: 0 Running disparity error count: 0 Loss of dword synchronization count: 0 Phy reset problem count: 0

root@cctvfs:~/src/git/zbx-smartctl/discovery-scripts/nix#

v-zhuravlev commented 7 years ago

Hi, Do you mean that it can't parse items for UserParameters? In userParameters it's smartctl -A not -x.

In other words, can you show smartctl -A and shot what is exactly different?

Thank you

rkarlsba commented 7 years ago

-A is only a subset of -x - from the manual

   -a, --all
          Prints all SMART information about the disk, or TapeAlert information about the tape drive or changer.  For ATA devices this is equivalent to
          ´-H -i -c -A -l error -l selftest -l selective´
          and for SCSI, this is equivalent to
          ´-H -i -A -l error -l selftest´.
          Note that for ATA disks this does not enable the non-SMART options and the SMART options which require support for 48-bit ATA commands.

   -x, --xall
          Prints all SMART and non-SMART information about the device. For ATA devices this is equivalent to
          ´-H -i -g all -c -A -f brief -l xerror,error -l xselftest,selftest
          -l selective -l directory -l scttemp -l scterc -l devstat -l sataphy´.
          and for SCSI, this is equivalent to
          ´-H -i -A -l error -l selftest -l background -l sasphy´.

As for reference - here's -A output

root@cctvfs:~# smartctl -A /dev/sdag smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.2.0-4-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION === Current Drive Temperature: 46 C Drive Trip Temperature: 60 C

Manufactured in week 08 of year 2014 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 136 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 1061 Elements in grown defect list: 0

Vendor (Seagate) cache information Blocks sent to initiator = 2959818316 Blocks received from initiator = 2062697542 Blocks read from cache and sent to initiator = 1967835229 Number of read and write commands whose size <= segment size = 1106135878 Number of read and write commands whose size > segment size = 98910

Vendor (Seagate/Hitachi) factory information number of hours powered up = 22342.05 number of minutes until next internal SMART test = 56

root@cctvfs:~#

v-zhuravlev commented 7 years ago

ineteresting... thanks for letting me know, i'll try to find time to include that.

leezout commented 7 years ago

Any updates on that ? I am facing the exact same issue, under Zabbix 3.2 though (using 3.0 template)

colttt commented 6 years ago

My output looks a little bit different:

normal SAS3-Harddisk:

smartctl -A /dev/disk/by-partlabel/j1d0
j1d00-ssd  j1d01-ssd  j1d02-ssd  j1d03-hdd  j1d04-hdd  j1d05-hdd  j1d06-hdd  j1d07-hdd  j1d08-hdd  j1d09-hdd  
root@zfs-serv1:/home/skrueger# smartctl -A /dev/disk/by-partlabel/j1d20-hdd 
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-3-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     34 C
Drive Trip Temperature:        60 C

Manufactured in week 27 of year 2017
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  56
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  3067
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 51448
  Blocks received from initiator = 43184
  Blocks read from cache and sent to initiator = 334419
  Number of read and write commands whose size <= segment size = 214
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 1668.58
  number of minutes until next internal SMART test = 59
smartctl -x /dev/disk/by-partlabel/j1d20-hdd  
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-3-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST6000NM0095
Revision:             E002
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c50094425db3
Serial number:        ZAD1W0YX0000C75038DY
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Nov  6 10:22:27 2017 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     34 C
Drive Trip Temperature:        60 C

Manufactured in week 27 of year 2017
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  56
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  3067
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 51448
  Blocks received from initiator = 43184
  Blocks read from cache and sent to initiator = 334419
  Number of read and write commands whose size <= segment size = 214
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 1668.58
  number of minutes until next internal SMART test = 59

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:      56244        0         0     56244          0          0.026           0
write:         0        0         0         0          0          0.187           0

Non-medium error count:        0

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
No self-tests have been logged

Background scan results log
  Status: no scans active
    Accumulated power on time, hours:minutes 1668:35 [100115 minutes]
    Number of background scans performed: 0,  scan progress: 0.00%
    Number of background medium scans performed: 0

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000c50094425db1
    attached SAS address = 0x500304800910187f
    attached phy identifier = 36
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 1
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 1
     Phy reset problem count: 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000c50094425db2
    attached SAS address = 0x50030480091018ff
    attached phy identifier = 36
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 2
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 2
     Phy reset problem count: 0

SAS3 SSD:

smartctl -A /dev/disk/by-partlabel/j1d01-ssd 
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-3-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     28 C
Drive Trip Temperature:        70 C

Manufactured in week 30 of year 2017
Specified cycle count over device lifetime:  0
Accumulated start-stop cycles:  0
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
defect list format 6 unknown
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 3400674575108
smartctl -x /dev/disk/by-partlabel/j1d01-ssd 
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-3-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUSMM1640ASS200
Revision:             A360
Compliance:           SPC-4
User Capacity:        400,088,457,216 bytes [400 GB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5000cca04ec26364
Serial number:        0QYEX3AA
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Nov  6 10:20:33 2017 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature:     28 C
Drive Trip Temperature:        70 C

Manufactured in week 30 of year 2017
Specified cycle count over device lifetime:  0
Accumulated start-stop cycles:  0
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
defect list format 6 unknown
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 3400674574336

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0         63.824           0
write:         0        0         0         0          0         64.403           0

Non-medium error count:        0

No self-tests have been logged

Background scan results log
  Status: no scans active
    Accumulated power on time, hours:minutes 1667:36 [100056 minutes]
    Number of background scans performed: 0,  scan progress: 0.00%
    Number of background medium scans performed: 0

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 6
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca04ec26365
    attached SAS address = 0x500304800910187f
    attached phy identifier = 1
    Invalid DWORD count = 4
    Running disparity error count = 4
    Loss of DWORD synchronization = 3
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 4
     Running disparity error count: 4
     Loss of dword synchronization count: 3
     Phy reset problem count: 0
relative target port id = 2
  generation code = 6
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca04ec26366
    attached SAS address = 0x50030480091018ff
    attached phy identifier = 1
    Invalid DWORD count = 4
    Running disparity error count = 4
    Loss of DWORD synchronization = 3
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 4
     Running disparity error count: 4
     Loss of dword synchronization count: 3
     Phy reset problem count: 0

I think important are:

maybe thats all?!

sqwit commented 6 years ago

any update on that?

v-zhuravlev commented 6 years ago

as a start: https://github.com/v-zhuravlev/zbx-smartctl/tree/sas this branch have the following changes: Template for 3.4 and new scripts that should catch:

colttt commented 6 years ago

Just for Information, the next version of smartmon, will have a json format what make it a lot easier to parse them. https://www.smartmontools.org/ticket/766

@v-zhuravlev I will try it next week

v-zhuravlev commented 6 years ago

Thank you, that’s interesting

From: colttt Sent: 31 мая 2018 г. 16:58 To: v-zhuravlev/zbx-smartctl Cc: Vitaly Zhuravlev; Mention Subject: Re: [v-zhuravlev/zbx-smartctl] Needs rewrite for SAS (#24)

Just for Information, the next version of smartmon, will have a json format what make it a lot easier to parse them. https://www.smartmontools.org/ticket/766 @v-zhuravlev I will try it next week — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

yanky83 commented 5 years ago

Currently struggling with the same issue.

Was wondering: it seems SATA provides a lot more information than SAS.

How would one distinguish between SAS and SATA in terms of created items? Guess one does not want to have a bunch of unsupported items for SAS?

Is it possible to define a condition, which decides if an item (prototype) gets created or not?

Could not find such in Zabbix (maybe Im missing sth).

Only way I could think of so far, was to create different discovery rules, but that would mean to duplicate a lot of the item prototypes/triggers, etc.

Thoughts?

v-zhuravlev commented 5 years ago

@yanky83 , @sqwit , @colttt , @rkarlsba . Added SAS ( and also NVMe) support in the latest version of the template (3.4+) and discovery scripts. Please test it and tell me what you think. Thank you!

TidalWave123 commented 4 years ago

Hi v-zhuravlev, I think you are doing great work and I love this tool! I am also trying to get the uncorrectable errors from my SAS drives, as those are the best indication of early failure, so our clients don't experience data corruption/sluggish speeds.

However, some of the features of the SAS drives are not working. Namely the uncorrectable errors. I see a note about parsing with JS read+write, but I don't know how to do that.

Could you please give me some advice?

Screen Shot 2019-12-08 at 8 49 49 AM