v-zhuravlev / zbx-smartctl

Templates and scripts for monitoring disks health with Zabbix and smartmontools
https://share.zabbix.com/storage-devices/smartmontools/smart-monitoring-with-smartmontools-lld
GNU General Public License v3.0
245 stars 127 forks source link

NVMe Wearout Incorrect #129

Open reedacus25 opened 4 years ago

reedacus25 commented 4 years ago

NVMe: 'Available Spare:' value is used, where 100% idicated new drive and 0% indicates that 100% of the expected lifetime has been used.

I think this value should be replaced with the "Percentage Used" value, or at least presented as a separate value.

I have multiple Samsung, Micron, and Intel NVMe disks, and all report 100% Available Spare, but all have varying levels of Percentage Used, which appears to be a more accurate measurement. For the sake of data points, 43 of 43 drives I monitor report 100% Available Spare, but some disks report over 10% of Percentage Used.

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPEDME400G4
Firmware Version:                   8DV10171
PCI Vendor/Subsystem ID:            0x8086
Namespace 1 Size/Capacity:          400,088,457,216 [400 GB]

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        50 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    12%
Data Units Read:                    35,239,374 [18.0 TB]
Data Units Written:                 682,594,316 [349 TB]
Host Read Commands:                 1,594,912,009
Host Write Commands:                55,835,107,695
Controller Busy Time:               3,690
Power Cycles:                       22
Power On Hours:                     32,708
Unsafe Shutdowns:                   15
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

It is reporting 349TB written, of the rated 2.19 PBW, which equates to roughly 15% of the quoted write endurance.

So while the Percentage Used value is under-reporting, it is still a great deal more accurate than the Available Spare value reported at 100%.

Additional Smart Log for NVME device:nvme0 namespace-id:ffffffff
key                               normalized raw
program_fail_count              : 100%       0
erase_fail_count                : 100%       0
wear_leveling                   :  88%       min: 552, max: 1466, avg: 1011
end_to_end_error_detection_count: 100%       0
crc_error_count                 : 100%       0
timed_workload_media_wear       : 100%       12.140%
timed_workload_host_reads       : 100%       4%
timed_workload_timer            : 100%       1962358 min
thermal_throttle_status         : 100%       0%, cnt: 0
retry_buffer_overflow_count     : 100%       0
pll_lock_loss_count             : 100%       0
nand_bytes_written              : 100%       sectors: 16490505
host_bytes_written              : 100%       sectors: 10415598

Additionally if looking at the smart-log-add values reported by nvmecli, the wear leveling value correlates to 100%-$percentage_used, which you would expect.

nerijus commented 4 years ago

Same here. SMART became FAILED when Percentage Used: became 100%.

# smartctl -i /dev/nvme0
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-957.21.3.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB512HAJQ-00000
Serial Number:                      S3W8NX0M418376
Firmware Version:                   EXA7301Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512.110.190.592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512.110.190.592 [512 GB]
Namespace 1 Utilization:            511.946.563.584 [511 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 8491b7f23d
Local Time is:                      Mon Jun 15 07:44:20 2020 UTC

# smartctl -H /dev/nvme0
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-957.21.3.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

#  smartctl -A /dev/nvme0
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-957.21.3.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        49 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    100%
Data Units Read:                    67.818.793 [34,7 TB]
Data Units Written:                 250.587.406 [128 TB]
Host Read Commands:                 757.189.114
Host Write Commands:                5.249.091.070
Controller Busy Time:               4.443.556.843
Power Cycles:                       8
Power On Hours:                     9.432
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      5
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               49 Celsius
Temperature Sensor 2:               69 Celsius

On another server after about a year of usage:

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        37 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    31%
Data Units Read:                    58.706.843 [30,0 TB]
Data Units Written:                 214.149.992 [109 TB]
Host Read Commands:                 982.074.335
Host Write Commands:                4.163.990.216
Controller Busy Time:               62.481
Power Cycles:                       16
Power On Hours:                     13.121
Unsafe Shutdowns:                   2
Media and Data Integrity Errors:    0
Error Information Log Entries:      24
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               37 Celsius
Temperature Sensor 2:               61 Celsius

As you can see, Available Spare is 100% in both cases, but Percentage Used is not.

nerijus commented 4 years ago

The following changes make wearout detection work: Preprocessing steps 1: Regular expression:

((?:(?:177 Wear_Leveling_Count|202 Percent_Lifetime_Used|202 Percent_Lifetime_Remain|202 Unknown_SSD_Attribute|230 Media_Wearout_Indicator|233 Media_Wearout_Indicator|231 SSD_Life_Left) +0x[0-9a-z]+|Percentage Used:)) +([0-9]+)
\1|\2

2: JavaScript:

return (value.split("|")[0] == "Percentage Used:" ? 100-value.split("|")[1] : value.split("|")[1]);
nerijus commented 4 years ago

Power on hours is also incorrect. For an old drive (about a year of usage) zabbix displays "Power on hours: 9h", but smartctl -A /dev/nvme0:

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    104%
Data Units Read:                    69.752.883 [35,7 TB]
Data Units Written:                 255.405.248 [130 TB]
Host Read Commands:                 768.708.938
Host Write Commands:                5.324.089.476
Controller Busy Time:               4.452.651.549
Power Cycles:                       9
Power On Hours:                     9.602
Unsafe Shutdowns:                   1
Media and Data Integrity Errors:    0
Error Information Log Entries:      6
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               41 Celsius
Temperature Sensor 2:               58 Celsius

It should be 9602 hours, because 9602/24=400 days.

For a week old drive it shows "Power on hours: 6d 19h", but smartctl -A /dev/nvme1:

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    968.434 [495 GB]
Data Units Written:                 5.568.552 [2,85 TB]
Host Read Commands:                 10.090.298
Host Write Commands:                72.233.886
Controller Busy Time:               1.291
Power Cycles:                       2
Power On Hours:                     163
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               45 Celsius
Temperature Sensor 2:               82 Celsius

It should be 163 hours - 163/24=6.79 days, which seems OK.

makarov-e-v commented 2 years ago

I guess it is wrong assumption. The less SSD wearout the worst. On the contrary, Percentage Used value 0 indicates brand new drive. Available Spare in this template means more like Reallocated Sector count. I've ended up writing new item and triggers

makarov-e-v commented 2 years ago

Power on hours is also incorrect. For an old drive (about a year of usage) zabbix displays "Power on hours: 9h", but smartctl -A /dev/nvme0:

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    104%
Data Units Read:                    69.752.883 [35,7 TB]
Data Units Written:                 255.405.248 [130 TB]
Host Read Commands:                 768.708.938
Host Write Commands:                5.324.089.476
Controller Busy Time:               4.452.651.549
Power Cycles:                       9
Power On Hours:                     9.602
Unsafe Shutdowns:                   1
Media and Data Integrity Errors:    0
Error Information Log Entries:      6
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               41 Celsius
Temperature Sensor 2:               58 Celsius

It should be 9602 hours, because 9602/24=400 days.

For a week old drive it shows "Power on hours: 6d 19h", but smartctl -A /dev/nvme1:

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    968.434 [495 GB]
Data Units Written:                 5.568.552 [2,85 TB]
Host Read Commands:                 10.090.298
Host Write Commands:                72.233.886
Controller Busy Time:               1.291
Power Cycles:                       2
Power On Hours:                     163
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               45 Celsius
Temperature Sensor 2:               82 Celsius

It should be 163 hours - 163/24=6.79 days, which seems OK.

Smartctl for NVME drives return power-on hours with comma, so i did some changes in regexp that processes smartctl output. Now it looks like this (?:Power_On_Hours.+ |Accumulated power on time, hours:minutes |Power On Hours:.+ )(\d*[,.]?\d*) And then add processing step that replaces comma with nothing. That did the trick.

nerijus commented 2 years ago

Could you please create a PR or show the diffs you done here?

makarov-e-v commented 2 years ago

Could you please create a PR or show the diffs you done here?

I don't think a can create a PR, but i can attach edited template file zbx_export_templates.tar.gz Hope it would help