thomas-krenn / check_lsi_raid

Monitoring plugin to check MegaRAID controllers
GNU General Public License v3.0
59 stars 26 forks source link

Shield Counter #29

Closed bcchrisupp closed 3 years ago

bcchrisupp commented 3 years ago

This isn't an issue with the plugin, but I was hoping you might be able to tell me what the "Shield Counter" for a drive is. We have a drive that is showing a value of 2 but I wasn't sure if we should pull the disk as I don't see any other issues with the disk.

gschoenberger commented 3 years ago

Maybe @tniedermeier could help us?

tniedermeier commented 3 years ago

Hi @bcchrisupp and @gschoenberger,

I checked all available resources, the only piece of information I've found is the following (from our check_lsi_raid help section):

[ -Is | --ignore-shield-counter ] Specifies the warning threshold for media errors per disk, the default threshold is 0.

The manual of StorCLI unfortunately doesn't mention "Shield Counter"...

Best regards, Thomas

bcchrisupp commented 3 years ago

Thanks for the speedy reply @tniedermeier and @gschoenberger

What's odd is that for the drive we have that is reporting an increase in it's "Shield Counter" value, I'm seeing no increase for the "Media Errors Count"

root@bspod21:~# for i in {0..45} ; do /usr/local/MegaRAID\ Storage\ Manager/StorCLI/storcli64 /c0/e0/s$i show all ; done | grep "Detailed\|Error\|Model\ Number\|Shield\ Counter"
...redacted
Drive /c0/e0/s5 - Detailed Information :
Shield Counter = 2
Media Error Count = 0
Other Error Count = 0
Model Number = HUH721010AL4204`
...redacted

I'll ignore this attribute for the time being and pay closer attention to media errors and smart statistics reported by the controller.

gschoenberger commented 3 years ago

Maybe this is helpful, from a broadcom RAID controller manual. According to this statement, the shield counter is the value the disk was in the shield state - an advanced diagnostic state recovering a disk: A new enterprise feature employed by the 12 Gb/s MegaRAID SAS controllers is advanced drive diagnostic technology. In the event of a physical drive failure, the drive is placed in shield state and the MegaRAID controller starts drive diagnostics to determine if the drive is indeed failed or can be restored. This saves customers time, money, and lost compute time associated with transient drive failures and unnecessary drive returns.

bcchrisupp commented 3 years ago

Thanks @gschoenberger for finding that, I'll have to dig deeper next time 🤦