influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
MIT License
14.51k stars 5.55k forks source link

Attributes flag does not have expected function #6619

Closed ryan-peck closed 4 years ago

ryan-peck commented 4 years ago

Relevant telegraf.conf:

  use_sudo = true

System info:

Telegraf version 1.12.4 CentOS 7 Smartctl 7.0

Steps to reproduce:

  1. Run telegraf with attributes flag set to false
  2. Observe that NVMe devices do not record temperature, while non-NVMe devices do, even if the non-NVMe devices hold this information in the vendor specific attributes section
  3. Run telegraf with attributes flag set to true
  4. Observe that NVMe and non-NVMe devices now record temperature

Expected behavior:

smartctl --info --health --attributes --tolerance=verypermissive --nocheck standby --format=brief /dev/nvme0
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-862.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke,

Model Number:                       SAMSUNG MZQLB3T8HALS-00007
Serial Number:                      S438NF0M304843
Firmware Version:                   EDA5202Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 3,840,755,982,336 [3.84 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          3,840,755,982,336 [3.84 TB]
Namespace 1 Utilization:            60,272,201,728 [60.2 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Tue Nov  5 17:01:43 2019 PST

SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius   

^The above line should be recorded with attributes set to false

Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    7,994,886 [4.09 TB]
Data Units Written:                 333,054 [170 GB]
Host Read Commands:                 17,607,817
Host Write Commands:                1,411,082
Controller Busy Time:               44
Power Cycles:                       52
Power On Hours:                     506
Unsafe Shutdowns:                   34
Media and Data Integrity Errors:    0
Error Information Log Entries:      5
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               27 Celsius
Temperature Sensor 2:               31 Celsius
Temperature Sensor 3:               36 Celsius
smartctl --info --health --attributes --tolerance=verypermissive --nocheck standby --format=brief /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-862.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke,

Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZ7LM3T8HMLP-00005
Serial Number:    S2TYNX0J702931
LU WWN Device Id: 5 002538 c406fe884
Firmware Version: GXT5404Q
User Capacity:    3,840,755,982,336 bytes [3.84 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Nov  5 16:49:04 2019 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode was:   IDLE

SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   098   098   000    -    5758
 12 Power_Cycle_Count       -O--CK   098   098   000    -    1487
177 Wear_Leveling_Count     PO--C-   099   099   005    -    62
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   100   100   010    -    0
180 Unused_Rsvd_Blk_Cnt_Tot PO--C-   100   100   010    -    13078
181 Program_Fail_Cnt_Total  -O--CK   100   100   010    -    0
182 Erase_Fail_Count_Total  -O--CK   100   100   010    -    0
183 Runtime_Bad_Block       PO--C-   100   100   010    -    0
184 End-to-End_Error        PO--CK   100   100   097    -    0
187 Uncorrectable_Error_Cnt -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O--CK   073   046   000    -    27
194 Temperature_Celsius     -O---K   073   046   000    -    27 (Min/Max 20/54) 

^The above line should not record temperature with attributes set to false

195 ECC_Error_Rate          -O-RC-   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   100   100   000    -    0
199 CRC_Error_Count         -OSRCK   099   099   000    -    4
202 Exception_Mode_Status   PO--CK   100   100   010    -    0
235 POR_Recovery_Count      -O--C-   099   099   000    -    1474
241 Total_LBAs_Written      -O--CK   099   099   000    -    264081103488
242 Total_LBAs_Read         -O--CK   099   099   000    -    210403669236
243 SATA_Downshift_Ct       -O--CK   100   100   000    -    0
244 Thermal_Throttle_St     -O--CK   100   100   000    -    0
245 Timed_Workld_Media_Wear -O--CK   100   100   000    -    65535
246 Timed_Workld_RdWr_Ratio -O--CK   100   100   000    -    65535
247 Timed_Workld_Timer      -O--CK   100   100   000    -    65535
251 NAND_Writes             -O--CK   100   100   000    -    531474149184
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

Actual behavior:

The opposite. When attributes is false, non-nvme temperature is recorded from the attributes section, while nvme temperature is not. When attributes is true, all temperatures are recorded.

Additional info:

I believe the issue comes from misplacing the if collectAttributes line in the smart.go file. I also believe that smart_test.go should be amended to not only check that all required fields are present, but also that all fields that should be excluded are not present.

danielnelson commented 4 years ago

Thanks for the detailed report. I think you are right on what needs done, any chance you would be able to make your changes and open a pull request?

ryan-peck commented 4 years ago

Sure, I can do that