nobody43 / zabbix-smartmontools

Disk SMART monitoring for Linux, FreeBSD and Windows. LLD, trapper.
The Unlicense
54 stars 19 forks source link

Quick question about Number of Non-Medium errors has changed #38

Closed killmasta93 closed 4 years ago

killmasta93 commented 4 years ago

Describe the problem Hi, So today i got alert saying errors of non-medium the only thing that is odd is that i got an alert for all the disk which i find it odd

sdm: Number of Non-Medium errors has changed within past 5 days on prometheus3hagroup 1h 59m 39s No 1 action

To Reproduce

root@prometheus3:~# smartctl -a /dev/sdm
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-12-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              EH0600JEDHE
Revision:             HPD4
Compliance:           SPC-4
User Capacity:        600,127,266,816 bytes [600 GB]
Logical block size:   512 bytes
Rotation Rate:        15052 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c50097264bcf
Serial number:        S7M131FY0000K640607R
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Apr  1 13:00:46 2020 -05
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        60 C

Manufactured in week 17 of year 2016
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  49
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1092
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 2011114913
  Blocks received from initiator = 1173895515
  Blocks read from cache and sent to initiator = 3628945623
  Number of read and write commands whose size <= segment size = 2300688227
  Number of read and write commands whose size > segment size = 1007936

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 25380.35
  number of minutes until next internal SMART test = 13

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0     440834.342           0
write:         0        0         0         0          0      75481.238           0

Non-medium error count:      196

No self-tests have been logged

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Provide all outputs described in Testing step Serial numbers should be replaced with X_SERIAL_X.

Please complete the following information:

nobody43 commented 4 years ago

Is it a new installation?

Triggers that contain delta(5d)>0 and last()>0 will fire on any change unless last value is zero. E.g. when disk is replaced with zero values the trigger will not fire, but if value is less or more - it will. Therefore, replacing a faulty drive with faulty one will still trigger a problem that stays for 5 days (default).

killmasta93 commented 4 years ago

Thanks for the reply, what happened was that the server rebooted today, but no disk were changed

nobody43 commented 4 years ago

Please show me the history of one of the disks (Latest data, Non-medium error count).

killmasta93 commented 4 years ago

Thanks for the reply, here is the picture

image

nobody43 commented 4 years ago

You can see the increase right here, but to be sure click on Graph and select values.

killmasta93 commented 4 years ago

Thanks for the quick reply, im attaching the picture, so does that mean all the disks are about to die out??

image

nobody43 commented 4 years ago

Could not be told straight away. Considering it was already 178 and only jumped by one - recent change is not fatal. Just keep the disk in check and make backups. Given that it's not a single disk error, the reason could be: power supply, data cable or controller.

killmasta93 commented 4 years ago

Thanks for the reply, will keep in mind