influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.42k stars 5.55k forks source link

Smart can't get some disks status report #5740

Closed vvershkov closed 5 years ago

vvershkov commented 5 years ago

Feature Request

Smart input plugin can't read some disks

Proposal:

Use smartctl -H for disk status

Current behavior:

no info about hitachi disks at all

Desired behavior:

at least I want smart overall status

Smart is looks like this one:

# smartctl -a /dev/sdc
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-46-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721212AL5204
Revision:             C3Q1
Compliance:           SPC-4
User Capacity:        12,000,138,625,024 bytes [12.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2705ad68c
Serial number:        8HHLYPDH
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Apr 17 19:03:32 2019 MSK
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     26 C
Drive Trip Temperature:        85 C

Manufactured in week 35 of year 2018
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  6
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  37
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 2887258210304

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0        222      13598.283           0
write:         0        0         0         0        522          9.783           0
verify:        0        0         0         0       1375          0.000           0

Non-medium error count:        0

No self-tests have been logged

And with -H I can get a standart output:

# smartctl -H /dev/sdc
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-46-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
glinton commented 5 years ago

What telegraf version are you using?

chrishoage commented 5 years ago

I am also having a problem with a disk not appearing in the telegraf output

› sudo smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/sdh -d scsi # /dev/sdh, SCSI device
/dev/sdi -d scsi # /dev/sdi, SCSI device
/dev/sdj -d scsi # /dev/sdj, SCSI device
/dev/sdk -d scsi # /dev/sdk, SCSI device
/dev/sdl -d scsi # /dev/sdl, SCSI device
/dev/sdm -d scsi # /dev/sdm, SCSI device
/dev/sdn -d scsi # /dev/sdn, SCSI device
› sudo smartctl --info --attributes --health -n standby --format=brief /dev/sdg -d scsi
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-145-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LB provisioning type: unreported, LBPME=0, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca250ec3c9c
Serial number:        PK1334PEK49SBS
Device type:          disk
Local Time is:        Wed Apr 17 12:21:23 2019 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     34 C
› sudo telegraf --test --input-filter smart
2019-04-17T19:21:38Z I! Starting Telegraf 1.10.2
2019-04-17T19:21:38Z I! Using config file: /etc/telegraf/telegraf.conf
> smart_device,capacity=525112713216,device=sdj,enabled=Enabled,host=cortex,model=Crucial_CT525MX300SSD1,serial_no=16431465A85A,wwn=500a07511465a85a exit_status=0i,health_ok=true,read_error_rate=2i,temp_c=36i,udma_crc_errors=0i 1555528899000000000
> smart_device,capacity=525112713216,device=sdk,enabled=Enabled,host=cortex,model=Crucial_CT525MX300SSD1,serial_no=1651150FA577,wwn=500a0751150fa577 exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1555528899000000000
> smart_device,capacity=4000787030016,device=sde,enabled=Enabled,host=cortex,model=WDC\ WD40EFRX-68WT0N0,serial_no=WD-WCC4EM0WN624,wwn=50014ee2b51b9d7f exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=30i,udma_crc_errors=0i 1555528899000000000
> smart_device,capacity=4000787030016,device=sdn,enabled=Enabled,host=cortex,model=WDC\ WD40EFRX-68WT0N0,serial_no=WD-WCC4EECRN58H,wwn=50014ee20a98bd99 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=37i,udma_crc_errors=0i 1555528899000000000
> smart_device,capacity=4000787030016,device=sdl,enabled=Enabled,host=cortex,model=WDC\ WD40EFRX-68WT0N0,serial_no=WD-WCC4E4FKJ5DV,wwn=50014ee25fc65114 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=29i,udma_crc_errors=0i 1555528899000000000
> smart_device,capacity=4000787030016,device=sdb,enabled=Enabled,host=cortex,model=WDC\ WD40EFRX-68WT0N0,serial_no=WD-WCC4EK8ZSK37,wwn=50014ee2b51c8ebd exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=31i,udma_crc_errors=0i 1555528899000000000
> smart_device,capacity=4000787030016,device=sdm,enabled=Enabled,host=cortex,model=WDC\ WD40EFRX-68WT0N0,serial_no=WD-WCC4E4FKJH1X,wwn=50014ee20a70d5a0 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=29i,udma_crc_errors=0i 1555528899000000000
> smart_device,capacity=4000787030016,device=sdf,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK2334PEJM9B3T,wwn=5000cca250e4f530 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=34i,udma_crc_errors=0i 1555528900000000000
> smart_device,capacity=4000787030016,device=sdc,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK2334PEK4AXTT,wwn=5000cca250ec4105 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=34i,udma_crc_errors=0i 1555528900000000000
> smart_device,capacity=4000787030016,device=sda,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEKDXVTS,wwn=5000cca250f02751 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=32i,udma_crc_errors=0i 1555528900000000000
> smart_device,capacity=4000787030016,device=sdh,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEJLL6NS,wwn=5000cca250e4a210 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=31i,udma_crc_errors=0i 1555528900000000000
> smart_device,capacity=4000787030016,device=sdd,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEKDNZ0S,wwn=5000cca250f009ad exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=35i,udma_crc_errors=0i 1555528900000000000
> smart_device,capacity=32017047552,device=sdi,enabled=Enabled,host=cortex,model=SATA\ SSD,serial_no=AF3407621C2400203590 exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=30i 1555528903000000000

Note /dev/sdg is listed in smartctl --scan and reports data with sudo smartctl --info --attributes --health -n standby --format=brief /dev/sdg -d scsi but does not appear in sudo telegraf --test --input-filter smart using Telegraf 1.10.2 (git: HEAD 3303f5c3)

ddimick commented 5 years ago

In my environment, I'm seeing this behavior specifically with SAS drives. SATA drives on the same HBA are fine.

glinton commented 5 years ago

Can you try this linux amd64 build and run it with --debug. I'd love to find where the failure is at. Thanks!

ddimick commented 5 years ago

Telegraf unknown (git: bugfix/5740 85b8a490)

2019-04-17T22:01:28Z I! Starting Telegraf
2019-04-17T22:01:28Z I! Using config file: /etc/telegraf/telegraf.conf
2019-04-17T22:01:28Z D! [inputs.smart] adding device: []string{"/dev/sda", "-d", "scsi", "#", "/dev/sda,", "SCSI", "device"}
2019-04-17T22:01:28Z D! [inputs.smart] adding device: []string{"/dev/sdb", "-d", "scsi", "#", "/dev/sdb,", "SCSI", "device"}
2019-04-17T22:01:28Z D! [inputs.smart] adding device: []string{"/dev/sdc", "-d", "scsi", "#", "/dev/sdc,", "SCSI", "device"}
2019-04-17T22:01:28Z D! [inputs.smart] skipping device: []string{""}
2019-04-17T22:01:28Z D! [inputs.smart] devices: []string{"/dev/sda", "/dev/sdb", "/dev/sdc"}
2019-04-17T22:01:28Z D! [inputs.smart] gatherDisk '/dev/sdb' output: "smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-12-pve] (local build)\nCopyright (C) 2002-16, Bruce Allen, Christian Franke, www.smar$montools.org\n\n=== START OF INFORMATION SECTION ===\nVendor:               HITACHI\nProduct:              HUC103030CSS600\nRevision:             J350\nCompliance:           SPC-4\nUser Capacity:        300,$00,000,000 bytes [300 GB]\nLogical block size:   512 bytes\nRotation Rate:        10020 rpm\nForm Factor:          2.5 inches\nLogical Unit id:      0x5000cca00a4f91bc\nSerial number:        PDWDSKNE\nDevicetype:          disk\nTransport protocol:   SAS (SPL-3)\nLocal Time is:        Wed Apr 17 15:01:28 2019 PDT\nSMART support is:     Available - device has SMART capability.\nSMART support is:     Enabled\nTemp$rature Warning:  Disabled or Not Supported\n\n=== START OF READ SMART DATA SECTION ===\nSMART Health Status: OK\n\nCurrent Drive Temperature:     35 C\nDrive Trip Temperature:        85 C\n\nManufactured in $eek 52 of year 2009\nSpecified cycle count over device lifetime:  50000\nAccumulated start-stop cycles:  47\nElements in grown defect list: 0\n\nVendor (Seagate) cache information\n  Blocks sent to initiator= 7601969522802688\n\n"> smart_device,capacity=300000000000,device=sdb,enabled=Enabled,host=pve-1 exit_status=0i 1555538488000000000
2019-04-17T22:01:28Z D! [inputs.smart] gatherDisk '/dev/sda' output: "smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-12-pve] (local build)\nCopyright (C) 2002-16, Bruce Allen, Christian Franke, www.smar$montools.org\n\n=== START OF INFORMATION SECTION ===\nVendor:               HITACHI\nProduct:              HUC103030CSS600\nRevision:             J350\nCompliance:           SPC-4\nUser Capacity:        300,$00,000,000 bytes [300 GB]\nLogical block size:   512 bytes\nRotation Rate:        10020 rpm\nForm Factor:          2.5 inches\nLogical Unit id:      0x5000cca00a4bdbc8\nSerial number:        PDWAR9GE\nDevicetype:          disk\nTransport protocol:   SAS (SPL-3)\nLocal Time is:        Wed Apr 17 15:01:28 2019 PDT\nSMART support is:     Available - device has SMART capability.\nSMART support is:     Enabled\nTemp$rature Warning:  Disabled or Not Supported\n\n=== START OF READ SMART DATA SECTION ===\nSMART Health Status: OK\n\nCurrent Drive Temperature:     36 C\nDrive Trip Temperature:        85 C\n\nManufactured in $eek 52 of year 2009\nSpecified cycle count over device lifetime:  50000\nAccumulated start-stop cycles:  47\nElements in grown defect list: 0\n\nVendor (Seagate) cache information\n  Blocks sent to initiator= 7270983270400000\n\n"> smart_device,capacity=300000000000,device=sda,enabled=Enabled,host=pve-1 exit_status=0i 1555538488000000000
2019-04-17T22:01:28Z D! [inputs.smart] gatherDisk '/dev/sdc' output: "smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-12-pve] (local build)\nCopyright (C) 2002-16, Bruce Allen, Christian Franke, www.smar$montools.org\n\n=== START OF INFORMATION SECTION ===\nModel Family:     Samsung based SSDs\nDevice Model:     Samsung SSD 850 PRO 256GB\nSerial Number:    S39KNX0J718036J\nLU WWN Device Id: 5 002538 d4218a3d$\nFirmware Version: EXM04B6Q\nUser Capacity:    256,060,514,304 bytes [256 GB]\nSector Size:      512 bytes logical/physical\nRotation Rate:    Solid State Device\nForm Factor:      2.5 inches\nDevice is:    In smartctl database [for details use: -P show]\nATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c\nSATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)\nLocal Time is:    Wed Apr 17 15:01:28 $019 PDT\nSMART support is: Available - device has SMART capability.\nSMART support is: Enabled\nPower mode is:    ACTIVE or IDLE\n\n=== START OF READ SMART DATA SECTION ===\nSMART overall-health self-assessm$nt test result: PASSED\n\nSMART Attributes Data Structure revision number: 1\nVendor Specific SMART Attributes with Thresholds:\nID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE\n  5 Re$llocated_Sector_Ct   PO--CK   100   100   010    -    0\n  9 Power_On_Hours          -O--CK   097   097   000    -    14738\n 12 Power_Cycle_Count       -O--CK   099   099   000    -    47\n177 Wear_Leveling$Count     PO--C-   086   086   000    -    893\n179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   100   100   010    -    0\n181 Program_Fail_Cnt_Total  -O--CK   100   100   010    -    0\n182 Erase_Fail_Count_Total  -O-$CK   100   100   010    -    0\n183 Runtime_Bad_Block       PO--C-   100   100   010    -    0\n187 Uncorrectable_Error_Cnt -O--CK   100   100   000    -    0\n190 Airflow_Temperature_Cel -O--CK   069   052 000    -    31\n195 ECC_Error_Rate          -O-RC-   200   200   000    -    0\n199 CRC_Error_Count         -OSRCK   100   100   000    -    0\n235 POR_Recovery_Count      -O--C-   099   099   000    -    3$\n241 Total_LBAs_Written      -O--CK   099   099   000    -    36452846103\n                            ||||||_ K auto-keep\n                            |||||__ C event count\n                            |||$___ R error rate\n                            |||____ S speed/performance\n                            ||_____ O updated online\n                            |______ P prefailure warning\n\n"> smart_attribute,device=sdc,fail=-,flags=PO--CK,host=pve-1,id=5,name=Reallocated_Sector_Ct,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=0i,threshold=10i,value=100i,worst=100i 1555$38488000000000
> smart_attribute,device=sdc,fail=-,flags=-O--CK,host=pve-1,id=9,name=Power_On_Hours,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=14738i,threshold=0i,value=97i,worst=97i 1555538488$00000000
> smart_attribute,device=sdc,fail=-,flags=-O--CK,host=pve-1,id=12,name=Power_Cycle_Count,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=47i,threshold=0i,value=99i,worst=99i 155553848$000000000
> smart_attribute,device=sdc,fail=-,flags=PO--C-,host=pve-1,id=177,name=Wear_Leveling_Count,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=893i,threshold=0i,value=86i,worst=86i 15555$8488000000000
> smart_attribute,device=sdc,fail=-,flags=PO--C-,host=pve-1,id=179,name=Used_Rsvd_Blk_Cnt_Tot,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=0i,threshold=10i,value=100i,worst=100i 15$5538488000000000
> smart_attribute,device=sdc,fail=-,flags=-O--CK,host=pve-1,id=181,name=Program_Fail_Cnt_Total,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=0i,threshold=10i,value=100i,worst=100i 1$55538488000000000
> smart_attribute,device=sdc,fail=-,flags=-O--CK,host=pve-1,id=182,name=Erase_Fail_Count_Total,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=0i,threshold=10i,value=100i,worst=100i 1555538488000000000
> smart_attribute,device=sdc,fail=-,flags=PO--C-,host=pve-1,id=183,name=Runtime_Bad_Block,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=0i,threshold=10i,value=100i,worst=100i 1555538488000000000
> smart_attribute,device=sdc,fail=-,flags=-O--CK,host=pve-1,id=187,name=Uncorrectable_Error_Cnt,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=0i,threshold=0i,value=100i,worst=100i 1555538488000000000
> smart_attribute,device=sdc,fail=-,flags=-O--CK,host=pve-1,id=190,name=Airflow_Temperature_Cel,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=31i,threshold=0i,value=69i,worst=52i 1555538488000000000
> smart_attribute,device=sdc,fail=-,flags=-O-RC-,host=pve-1,id=195,name=ECC_Error_Rate,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=0i,threshold=0i,value=200i,worst=200i 1555538488000000000
> smart_attribute,device=sdc,fail=-,flags=-OSRCK,host=pve-1,id=199,name=CRC_Error_Count,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=0i,threshold=0i,value=100i,worst=100i 1555538488000000000
> smart_attribute,device=sdc,fail=-,flags=-O--C-,host=pve-1,id=235,name=POR_Recovery_Count,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=35i,threshold=0i,value=99i,worst=99i 1555538488000000000
> smart_attribute,device=sdc,fail=-,flags=-O--CK,host=pve-1,id=241,name=Total_LBAs_Written,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,raw_value=36452846103i,threshold=0i,value=99i,worst=99i 1555538488000000000
> smart_device,capacity=256060514304,device=sdc,enabled=Enabled,host=pve-1,model=Samsung\ SSD\ 850\ PRO\ 256GB,serial_no=S39KNX0J718036J,wwn=5002538d4218a3df exit_status=0i,health_ok=true,udma_crc_errors=0i 1555538488000000000
glinton commented 5 years ago

@chrishoage Can you paste the output of

sudo smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sdg

@vvershkov Can you paste the output of the same, but with /dev/sdc instead of /dev/sdg?

Thanks @ddimick. I assume a and b are your SAS drives?

chrishoage commented 5 years ago
› sudo smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sdg
[sudo] password for chris:
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-145-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Deskstar NAS
Device Model:     HGST HDN724040ALE640
Serial Number:    PK1334PEK49SBS
LU WWN Device Id: 5 000cca 250ec3c9c
Firmware Version: MJAOA5E0
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 17 15:14:27 2019 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode is:    ACTIVE or IDLE

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  P-S---   135   135   054    -    84
  3 Spin_Up_Time            POS---   125   125   024    -    621 (Average 619)
  4 Start_Stop_Count        -O--C-   100   100   000    -    33
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   119   119   020    -    35
  9 Power_On_Hours          -O--C-   098   098   000    -    19371
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    33
192 Power-Off_Retract_Count -O--CK   100   100   000    -    764
193 Load_Cycle_Count        -O--C-   100   100   000    -    764
194 Temperature_Celsius     -O----   176   176   000    -    34 (Min/Max 21/53)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning
ddimick commented 5 years ago

I assume a and b are your SAS drives?

Yes, that's correct.

glinton commented 5 years ago

Thanks. @chrishoage can you also paste the output of the same command but with a disk that is being collected (anything other than /dev/sdg)

chrishoage commented 5 years ago
› sudo smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sdh
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-145-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Deskstar NAS
Device Model:     HGST HDN724040ALE640
Serial Number:    PK1334PEJLL6NS
LU WWN Device Id: 5 000cca 250e4a210
Firmware Version: MJAOA5E0
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 17 16:27:58 2019 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode is:    ACTIVE or IDLE

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  P-S---   136   136   054    -    83
  3 Spin_Up_Time            POS---   125   125   024    -    621 (Average 617)
  4 Start_Stop_Count        -O--C-   100   100   000    -    28
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   124   124   020    -    33
  9 Power_On_Hours          -O--C-   098   098   000    -    19322
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    28
192 Power-Off_Retract_Count -O--CK   100   100   000    -    30
193 Load_Cycle_Count        -O--C-   100   100   000    -    30
194 Temperature_Celsius     -O----   187   187   000    -    32 (Min/Max 23/55)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning
vvershkov commented 5 years ago

Hm, I didn't think about it but yep, that is SAS drives.

My telegraf version is 1.10.0-1 but I can update it to 1.10.3 (I am using ubuntu 18.04 and influxdata repo).

smartctl output:

# smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sdg
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-46-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721212AL5204
Revision:             C3Q1
Compliance:           SPC-4
User Capacity:        12,000,138,625,024 bytes [12.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca27076bfe8
Serial number:        8HJ39K3H
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu Apr 18 13:25:03 2019 MSK
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     34 C
Drive Trip Temperature:        85 C

Manufactured in week 35 of year 2018
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  7
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  39
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 544135446528

(same for sdc - it has 60 drives from sdc to sdbj) sda and sdb are SATA drives and I can get their status via telegraf.

# smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-46-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Travelstar Z7K500
Device Model:     HGST HTE725050A7E630
Serial Number:    RCE50G20G81S9S
LU WWN Device Id: 5 000cca 90bc3a98b
Firmware Version: GS2OA3E0
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 18 13:27:51 2019 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode is:    ACTIVE or IDLE

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   062    -    0
  2 Throughput_Performance  P-S---   100   100   040    -    0
  3 Spin_Up_Time            POS---   100   100   033    -    1
  4 Start_Stop_Count        -O--C-   100   100   000    -    4
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   100   100   040    -    0
  9 Power_On_Hours          -O--C-   099   099   000    -    743
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    4
191 G-Sense_Error_Rate      -O-R--   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    2
193 Load_Cycle_Count        -O--C-   100   100   000    -    13
194 Temperature_Celsius     -O----   250   250   000    -    24 (Min/Max 15/29)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
223 Load_Retry_Count        -O-R--   100   100   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning