glensc / nagios-plugin-check_raid

Nagios/Icinga/Sensu plugin to check current server's RAID status ⛺
143 stars 85 forks source link

Failed spare drive not reported #100

Open danci1973 opened 9 years ago

danci1973 commented 9 years ago

HP Smart Array P400i with 4 drives - 3 are in RAID5 and one is a spare. The spare drive is in failed state, but check_raid doesn't report it.

./check_raid.pl -V
check_raid Version 3.2.3

$./check_raid.pl -d
usage: sudo -h | -K | -k | -L | -V
usage: sudo -v [-AknS] [-g groupname|#gid] [-p prompt] [-u user name|#uid]
usage: sudo -l[l] [-AknS] [-g groupname|#gid] [-p prompt] [-U user name] [-u
            user name|#uid] [-g groupname|#gid] [command]
usage: sudo [-AbEHknPS] [-r role] [-t type] [-C fd] [-g groupname|#gid] [-p
            prompt] [-u user name|#uid] [-g groupname|#gid] [VAR=value] [-i|-s]
            [<command>]
usage: sudo -e [-AknS] [-r role] [-t type] [-C fd] [-g groupname|#gid] [-p
            prompt] [-u user name|#uid] file ...
DEBUG EXEC: /proc/mdstat at ./check_raid.pl line 452.
DEBUG EXEC: /usr/bin/lsscsi -g at ./check_raid.pl line 452.
DEBUG EXEC: >&2 /usr/sbin/cciss_vol_status -v at ./check_raid.pl line 448.
DEBUG EXEC: /usr/sbin/cciss_vol_status /dev/sg0 at ./check_raid.pl line 452.
Unparsed[  Failed drives:] at ./check_raid.pl line 3490, <$fh> line 2.
Unparsed[         connector 1I box 1 bay 4                 HP      EF0300FARMU                          6SJ8ND840000N5191GHP     HPD6] at ./check_raid.pl line 3490, <$fh> line 3.
Unparsed[] at ./check_raid.pl line 3490, <$fh> line 4.
Unparsed[    Total of 1 failed physical drives detected on this logical drive.] at ./check_raid.pl line 3490, <$fh> line 5.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,0 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,1 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,2 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,3 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,4 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,5 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,6 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,7 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,8 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,9 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,10 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,11 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,12 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,13 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,14 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/smartctl -H /dev/sg0 -dcciss,15 at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/hpacucli controller all show status at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/hpacucli controller slot=0 logicaldrive all show at ./check_raid.pl line 452.
OK: cciss:[/dev/sda(Smart Array P410i): Volume 0 (RAID 5): OK /dev/sg0#0,/dev/sg0#1,/dev/sg0#2,/dev/sg0#3=OK]; hpacucli:[Smart Array P410i: Array A(OK)[LUN1:OK]]

The controller is detected by two plugins - cciss and hpacucli

_cciss_volstatus actually detects a failed physical drive, but doesn't say exactly which one it is:

$ /usr/sbin/cciss_vol_status /dev/sg0
/dev/sda: (Smart Array P410i) RAID 5 Volume 0 status: OK.   At least one spare drive designated.  At least one spare drive has failed.
  Failed drives:
         connector 1I box 1 bay 4                 HP      EF0300FARMU                          6SJ8ND840000N5191GHP     HPD6

    Total of 1 failed physical drives detected on this logical drive.

hpacucli, as it is used now, doesn't:

 $/usr/sbin/hpacucli controller slot=0 logicaldrive all show

Smart Array P410i in Slot 0 (Embedded)

    array A

       logicaldrive 1 (558.7 GB, RAID 5, OK)

However, hpacucli can show status of each physical drive, where the failure is visible:

$/usr/sbin/hpacucli controller slot=0 pd all show status

   physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 300 GB): OK
   physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 300 GB): OK
   physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 300 GB): OK
   physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 300 GB, spare): Failed

I suggest using this command line as an additional check to improve failed drive detection.

glensc commented 9 years ago

please share output of these two commands:

DEBUG EXEC: /usr/bin/lsscsi -g at ./check_raid.pl line 452.
DEBUG EXEC: >&2 /usr/sbin/cciss_vol_status -v at ./check_raid.pl line 448.

and what is your kernel version, and do you use cciss or hpsa driver?

as lsscsi is supposed to detect devices and fresh enough cciss_vol_status can report individual disks.

see 3.2.0 changelog

glensc commented 9 years ago

also be sure to save all outputs of your system now, i may need additional information, i.e command outputs.

danci1973 commented 9 years ago
# /usr/bin/lsscsi -g
[0:0:0:0]    storage HP       P410i            6.60  -         /dev/sg0
[0:0:0:1]    disk    HP       LOGICAL VOLUME   6.60  /dev/sda   /dev/sg1
# >&2 /usr/sbin/cciss_vol_status -v
cciss_vol_status version 1.09

The server is running RHEL 6.1 (no updates).

glensc commented 9 years ago

can you try cciss_vol_status 1.10+ ?

ps: you can write code block with triple backticks instead of indenting with 4 spaces. it's actually written in CONTRIBUTING.md

danci1973 commented 9 years ago
# /usr/sbin/cciss_vol_status -v
cciss_vol_status version 1.10

#./check_raid.pl -d
usage: sudo -h | -K | -k | -L | -V
usage: sudo -v [-AknS] [-g groupname|#gid] [-p prompt] [-u user name|#uid]
usage: sudo -l[l] [-AknS] [-g groupname|#gid] [-p prompt] [-U user name] [-u
            user name|#uid] [-g groupname|#gid] [command]
usage: sudo [-AbEHknPS] [-r role] [-t type] [-C fd] [-g groupname|#gid] [-p
            prompt] [-u user name|#uid] [-g groupname|#gid] [VAR=value] [-i|-s]
            [<command>]
usage: sudo -e [-AknS] [-r role] [-t type] [-C fd] [-g groupname|#gid] [-p
            prompt] [-u user name|#uid] file ...
DEBUG EXEC: /proc/mdstat at ./check_raid.pl line 452.
DEBUG EXEC: /usr/bin/lsscsi -g at ./check_raid.pl line 452.
DEBUG EXEC: >&2 /usr/sbin/cciss_vol_status -v at ./check_raid.pl line 448.
DEBUG EXEC: /usr/sbin/cciss_vol_status -V /dev/sg0 at ./check_raid.pl line 452.
Unparsed[  Failed drives:] at ./check_raid.pl line 3490, <$fh> line 7.
Unparsed[         connector 1I box 1 bay 4                 HP      EF0300FARMU                          6SJ8ND840000N5191GHP     HPD6] at ./check_raid.pl line 3490, <$fh> line 8.
Unparsed[] at ./check_raid.pl line 3490, <$fh> line 9.
Unparsed[    Total of 1 failed physical drives detected on this logical drive.] at ./check_raid.pl line 3490, <$fh> line 10.
DEBUG EXEC: /usr/sbin/hpacucli controller all show status at ./check_raid.pl line 452.
DEBUG EXEC: /usr/sbin/hpacucli controller slot=0 logicaldrive all show at ./check_raid.pl line 452.
OK: cciss:[/dev/sda(Smart Array P410i): Volume 0 (RAID 5): OK, Drives(3): 1I-1-1,1I-1-2,1I-1-3=OK]; hpacucli:[Smart Array P410i: Array A(OK)[LUN1:OK]]

# /usr/sbin/cciss_vol_status -V /dev/sg0
Controller: Smart Array P410i
  Board ID: 0x3245103c
  Logical drives: 1
  Running firmware: 6.60
  ROM firmware: 6.60
/dev/sda: (Smart Array P410i) RAID 5 Volume 0 status: OK.   At least one spare drive designated.  At least one spare drive has failed.
  Failed drives:
         connector 1I box 1 bay 4                 HP      EF0300FARMU                          6SJ8ND840000N5191GHP     HPD6

    Total of 1 failed physical drives detected on this logical drive.
  Physical drives: 3
         connector 1I box 1 bay 1                 HP      EF0300FATFD                                      JXY1BLJN     HPDB OK
         connector 1I box 1 bay 2                 HP      EF0300FATFD                                      JXXY6ZUN     HPDB OK
         connector 1I box 1 bay 3                 HP      EF0300FATFD                                      JXY1MARN     HPDB OK
glensc commented 9 years ago

i'm working on cciss_vol_status improvement (using hpacucli for monitoring is not recommended by the driver developers).

so, how do you want to represent this issue's problem?

just include messages about spare drives to output? these are spare drive status messages (two of them):

.   At least one spare drive designated.
    At least one spare drive has failed.

or should the state be changed as well?

danci1973 commented 9 years ago

I think a failed spare drive is just as 'critical' as any other, so it should be reflected in the state.

xorpaul commented 7 years ago

Any update on when this will make into a new release?