thomas-krenn / check_lsi_raid

Monitoring plugin to check MegaRAID controllers
GNU General Public License v3.0
59 stars 27 forks source link

wrong storcli version? #20

Closed christianuhlmann closed 6 years ago

christianuhlmann commented 6 years ago

Hi,

I use the following storcli version here:

StorCli SAS Customization Utility Ver 007.0504.0000.0000 Nov 22, 2017
(c) Copyright 2017, AVAGO Technologies, All Rights Reserved.
Exit Code: 0x00

on a xenserver 7.4 or xcp-ng 7.4.1, base is a centos 7.

the storcli runs flawlessly under root as well as under the users nagios and icinga, both got the necessary rights over sudoers.

I do not use NRPE, but the check script always outputs new errors:

Error 1:

Error: invalid controller number, controller not found!

Solution: following lines

sub getControllerTime {
        my $ storcli = shift;
        my @output = `$ storcli show time`;
        return (checkCommandStatus (\ @ output));
}

replace with

sub getControllerTime {
# return 1;
}

Output from commandline:

storcli /c0 show time
CLI Version = 007.0504.0000.0000 Nov 22, 2017
Operating system = Linux 4.4.0+10
Controller = 0
Status = Success
Description = None

Controller Properties :
=====================

------------------------------
Ctrl_Prop Value
------------------------------
Time      2018/05/17 23:58:24
------------------------------

Error 2:

Invalid StorCLI command! (/ usr / bin / sudo / opt / MegaRAID / storcli / storcli64 adpallinfo a0 -NoLog)

Code Lines:

sub getControllerInfo{
        my $storcli = shift;
        my $writelogs = shift;
        my $commands_a = shift;
        my $command = '';

        $storcli =~ /^(.*)\/c[0-9]+/;
        if(!defined($writelogs)){
                $command = $1.'adpallinfo a'.$CONTROLLER.' -NoLog';
        }
        else{
                $command = $1.'adpallinfo a'.$CONTROLLER;
        }

        push @{$commands_a}, $command;
        my @output = `$command`;
        if($? >> 8 != 0){
                print "Invalid StorCLI command! ($command)\n";
                exit(STATE_UNKNOWN);

Solution: no idea Output from commandline:

Adapter #0

==============================================================================
                    Versions
                ================
Product Name    : LSI MegaRAID SAS 9260-8i
Serial No       : SV13800819
FW Package Build: 12.12.0-0111

                    Mfg. Data
                ================
Mfg. Date       : 09/10/11
Rework Date     : 00/00/00
Revision No     : 79B
Battery FRU     : N/A

                Image Versions in Flash:
                ================
FW Version         : 2.130.353-1663
BIOS Version       : 3.24.00_4.12.05.00_0x05160000
Preboot CLI Version: 04.04-020:#%00009
WebBIOS Version    : 6.0-49-e_45-Rel
NVDATA Version     : 2.09.03-0032
Boot Block Version : 2.02.00.00-0000
BOOT Version       : 09.250.01.219

                Pending Images in Flash
                ================
None

                PCI Info
                ================
Controller Id   : 0000
Vendor Id       : 1000
Device Id       : 0079
SubVendorId     : 1014
SubDeviceId     : 03b2

Host Interface  : PCIE

ChipRevision    : B4

Number of Frontend Port: 0
Device Interface  : PCIE

Number of Backend Port: 8
Port  :  Address
0        4433221102000000
1        4433221103000000
2        4433221106000000
3        4433221104000000
4        4433221105000000
5        0000000000000000
6        0000000000000000
7        0000000000000000

                HW Configuration
                ================
SAS Address      : 500605b003aebe70
BBU              : Present
Alarm            : Present
NVRAM            : Present
Serial Debugger  : Present
Memory           : Present
Flash            : Present
Memory Size      : 512MB
TPM              : Absent
On board Expander: Absent
Upgrade Key      : Absent
Temperature sensor for ROC    : Absent
Temperature sensor for controller    : Absent

                Settings
                ================
Current Time                     : 23:56:31 5/17, 2018
Predictive Fail Poll Interval    : 300sec
Interrupt Throttle Active Count  : 16
Interrupt Throttle Completion    : 50us
Rebuild Rate                     : 30%
PR Rate                          : 30%
BGI Rate                         : 30%
Check Consistency Rate           : 30%
Reconstruction Rate              : 30%
Cache Flush Interval             : 4s
Max Drives to Spinup at One Time : 4
Delay Among Spinup Groups        : 2s
Physical Drive Coercion Mode     : Disabled
Cluster Mode                     : Disabled
Alarm                            : Enabled
Auto Rebuild                     : Enabled
Battery Warning                  : Enabled
Ecc Bucket Size                  : 15
Ecc Bucket Leak Rate             : 1440 Minutes
Restore HotSpare on Insertion    : Disabled
Expose Enclosure Devices         : Enabled
Maintain PD Fail History         : Enabled
Host Request Reordering          : Enabled
Auto Detect BackPlane Enabled    : SGPIO/i2c SEP
Load Balance Mode                : Auto
Use FDE Only                     : No
Security Key Assigned            : No
Security Key Failed              : No
Security Key Not Backedup        : No
Default LD PowerSave Policy      : Controller Defined
Maximum number of direct attached drives to spin up in 1 min : 120
Auto Enhanced Import             : No
Any Offline VD Cache Preserved   : No
Allow Boot with Preserved Cache  : No
Disable Online Controller Reset  : No
PFK in NVRAM                     : No
Use disk activity for locate     : No
POST delay                       : 90 seconds

                Capabilities
                ================
RAID Level Supported             : RAID0, RAID1, RAID5, RAID00, RAID10, RAID50, PRL 11, PRL 11 with spanning, SRL 3 supported, PRL11-RLQ0 DDF layout with no span, PRL11-RLQ0 DDF layout with span
Supported Drives                 : SAS, SATA
Boot Volume Supported            : NO

Allowed Mixing:

Mix in Enclosure Allowed
Mix of SAS/SATA of HDD type in VD Allowed

                Status
                ================
ECC Bucket Count                 : 0

                Limitations
                ================
Max Arms Per VD          : 32
Max Spans Per VD         : 8
Max Arrays               : 128
Max Number of VDs        : 64
Max Parallel Commands    : 1008
Max SGE Count            : 80
Max Data Transfer Size   : 8192 sectors
Max Strips PerIO         : 42
Max LD per array         : 16
Min Strip Size           : 8 KB
Max Strip Size           : 1.0 MB
Max Configurable CacheCade Size: 0 GB
Current Size of CacheCade      : 0 GB
Current Size of FW Cache       : 350 MB

                Device Present
                ================
Virtual Drives    : 2
  Degraded        : 0
  Offline         : 0
Physical Devices  : 6
  Disks           : 5
  Critical Disks  : 0
  Failed Disks    : 0

                Supported Adapter Operations
                ================
Rebuild Rate                    : Yes
CC Rate                         : Yes
BGI Rate                        : Yes
Reconstruct Rate                : Yes
Patrol Read Rate                : Yes
Alarm Control                   : Yes
Cluster Support                 : No
BBU                             : Yes
Spanning                        : Yes
Dedicated Hot Spare             : Yes
Revertible Hot Spares           : Yes
Foreign Config Import           : Yes
Self Diagnostic                 : Yes
Allow Mixed Redundancy on Array : No
Global Hot Spares               : Yes
Deny SCSI Passthrough           : No
Deny SMP Passthrough            : No
Deny STP Passthrough            : No
Support Security                : No
Snapshot Enabled                : No
Support the OCE without adding drives : Yes
Support PFK                     : No
Support PI                      : No
Support Boot Time PFK Change    : No
Disable Online PFK Change       : No
Support Shield State            : No
Block SSD Write Disk Cache Change: No
Point In Time Progress: No

                Supported VD Operations
                ================
Read Policy          : Yes
Write Policy         : Yes
IO Policy            : Yes
Access Policy        : Yes
Disk Cache Policy    : Yes
Reconstruction       : Yes
Deny Locate          : No
Deny CC              : No
Allow Ctrl Encryption: No
Enable LDBBM         : No
Support Breakmirror  : No
Power Savings        : No

                Supported PD Operations
                ================
Force Online                            : Yes
Force Offline                           : Yes
Force Rebuild                           : Yes
Deny Force Failed                       : No
Deny Force Good/Bad                     : No
Deny Missing Replace                    : No
Deny Clear                              : No
Deny Locate                             : No
Support Temperature                     : Yes
NCQ                                     : No
Disable Copyback                        : No
Enable JBOD                             : No
Enable Copyback on SMART                : No
Enable Copyback to SSD on SMART Error   : Yes
Enable SSD Patrol Read                  : No
PR Correct Unconfigured Areas           : Yes
                Error Counters
                ================
Memory Correctable Errors   : 0
Memory Uncorrectable Errors : 0

                High Availability Properties
                ================
Topology Type                 : None
                Cluster Information
                ================
Cluster Permitted     : No
Cluster Active        : No

                Default Settings
                ================
Phy Polarity                     : 0
Phy PolaritySplit                : 0
Background Rate                  : 30
Strip Size                       : 64kB
Flush Time                       : 4 seconds
Write Policy                     : WB
Read Policy                      : Adaptive
Cache When BBU Bad               : Disabled
Cached IO                        : No
SMART Mode                       : Mode 6
Alarm Disable                    : Yes
Coercion Mode                    : None
ZCR Config                       : Unknown
Dirty LED Shows Drive Activity   : No
BIOS Continue on Error           : No
Spin Down Mode                   : None
Allowed Device Type              : SAS/SATA Mix
Allow Mix in Enclosure           : Yes
Allow HDD SAS/SATA Mix in VD     : Yes
Allow SSD SAS/SATA Mix in VD     : No
Allow HDD/SSD Mix in VD          : No
Allow SATA in Cluster            : No
Max Chained Enclosures           : 16
Disable Ctrl-R                   : Yes
Enable Web BIOS                  : Yes
Direct PD Mapping                : No
BIOS Enumerate VDs               : Yes
Restore Hot Spare on Insertion   : No
Expose Enclosure Devices         : Yes
Maintain PD Fail History         : Yes
Disable Puncturing               : No
Zero Based Enclosure Enumeration : No
PreBoot CLI Enabled              : Yes
LED Show Drive Activity          : Yes
Cluster Disable                  : Yes
SAS Disable                      : No
Auto Detect BackPlane Enable     : SGPIO/i2c SEP
Use FDE Only                     : No
Enable Led Header                : Yes
Delay during POST                : 0
EnableCrashDump                  : No
Disable Online Controller Reset  : No
EnableLDBBM                      : No
Un-Certified Hard Disk Drives    : Allow
Treat Single span R1E as R10     : No
Max LD per array                 : 16
Power Saving option              : Don't Auto spin down Configured Drives
Max power savings option is  not allowed for LDs. Only T10 power conditions are to be used.
Default spin down time in minutes: 30
Enable JBOD                      : No
TTY Log In Flash                 : No
Auto Enhanced Import             : No
BreakMirror RAID Support         : No
Disable Join Mirror              : No
Enable Shield State              : No
Time taken to detect CME         : 60s

Exit Code: 0x00

Here I once stopped analyzing, but something seems to be wrong.

Can someone give me a hint?

Thanks and Greetings Christian

christianuhlmann commented 6 years ago

adding original output from storcli commands

gschoenberger commented 6 years ago

Can you also post the output of the plugin with "-vvv" enabled?

gschoenberger commented 6 years ago

Can you also check if the storcli commands are returning 0 and controller number is correct? I have a test script, where my storcli is just a wrapper around printing the output on the command line and let the plugin parse the output:

./check_lsi_raid -p ./storcli-testing -C 0 -vvv
Use of uninitialized value $controllerToCheck{"ROC temperature"} in concatenation (.) or string at ./check_lsi_raid line 1126.
Critical (LD Warn, PD Warn, BBU Warn, CV Crit) [CV_Replacement_required = Critical (Yes)][c0/v1_State = Critical (Dgrd)][c0/v2_State = Critical (BLa)][c0/e252/s1_State = Critical (Dgrd)][c0/e252/s2_State = Critical (Offln)][BBU_State = Warning (Failed)][BBU_Voltage = Warning (Low)][c0/v0_Consist = Warning (No)][c0/v1_Consist = Warning (No)][c0/v1_Init = Warning (25)][c0/e252/s0_BBM_error_count = Warning (15)][c0/e252/s1_Predictive_failure_count = Warning (10)][c0/e252/s1_SMART_flag = Warning][c0/e252/s2_Shield_counter = Warning (5)][c0/e252/s2_Media_error_count = Warning (3)][c0/e252/s2_Init = Warning (33)][c0/e252/s1_Rebuild = Warning (25)][c0/e252/s3_Rebuild = Warning (60)]|BBU_Temperature=28;50;60 CV_Temperature=23;70;85 c0/e252/s0_Drive_Temperature=31;40;45 c0/e252/s1_Drive_Temperature=30;40;45
Used storcli commands:
- /usr/bin/sudo ./storcli-testing /c0 /bbu show status
- /usr/bin/sudo ./storcli-testing /c0 /cv show status
- /usr/bin/sudo ./storcli-testing adpallinfo a0 -NoLog
- /usr/bin/sudo ./storcli-testing /c0/vall show all
- /usr/bin/sudo ./storcli-testing /c0/vall show init
- /usr/bin/sudo ./storcli-testing /c0/eall/sall show all
- /usr/bin/sudo ./storcli-testing /c0/eall/sall show initialization
- /usr/bin/sudo ./storcli-testing /c0/eall/sall show rebuild
Critical sensors:
    - CV_Replacement_required (Yes)
    - c0/v1_State (Dgrd)
    - c0/v2_State (BLa)
    - c0/e252/s1_State (Dgrd)
    - c0/e252/s2_State (Offln)
Warning sensors:
    - BBU_State (Failed)
    - BBU_Voltage (Low)
    - c0/v0_Consist (No)
    - c0/v1_Consist (No)
    - c0/v1_Init (25)
    - c0/e252/s0_BBM_error_count (15)
    - c0/e252/s1_Predictive_failure_count (10)
    - c0/e252/s1_SMART_flag
    - c0/e252/s2_Shield_counter (5)
    - c0/e252/s2_Media_error_count (3)
    - c0/e252/s2_Init (33)
    - c0/e252/s1_Rebuild (25)
    - c0/e252/s3_Rebuild (60)
CTR information:
    - LSI MegaRAID SAS 9260-8i:
        - Serial No=SV13800819
        - FW Package Build=12.12.0-0111
        - Mfg. Date=09/10/11
        - Revision No=79B
        - BIOS Version=3.24.00_4.12.05.00_0x05160000
        - FW Version=2.130.353-1663
        - ROC temperature=

I have inserted your output from above and besides from the ROC temperature the output seems fine and the plugin can also parse it. Also the output from "show time" seems valid.

Therefore I assume we have some troubles with the setup...

christianuhlmann commented 6 years ago

Hi Thanks for your analysis, unfortunately I do not know what is wrong with the setup. do you have a hint for me?

here the output with -vvv

'/usr/lib64/nagios/plugins/check_lsi_raid' '-C' '0' '-p' '/usr/sbin/storcli' -vvv

Use of uninitialized value $controllerToCheck{"ROC temperature"} in concatenation (.) or string at /usr/lib64/nagios/plugins/check_lsi_raid line 1127.
Critical (LD Warn, BBU Crit) [BBU_Firmware_temperature = Critical (High)][c0/v0_Consist = Warning (No)][c0/v1_Consist = Warning (No)]|BBU_Firmware_temperature=High BBU_Temperature=46;50;60 c0/e252/s4_Drive_Temperature=34;40;45 c0/e252/s5_Drive_Temperature=36;40;45 c0/e252/s6_Drive_Temperature=34;40;45
Used storcli commands:
- /usr/sbin/storcli /c0 /bbu show status
- /usr/sbin/storcli adpallinfo a0 -NoLog
- /usr/sbin/storcli /c0/vall show all
- /usr/sbin/storcli /c0/vall show init
- /usr/sbin/storcli /c0/eall/sall show all
- /usr/sbin/storcli /c0/eall/sall show initialization
- /usr/sbin/storcli /c0/eall/sall show rebuild
Critical sensors:
        - BBU_Firmware_temperature (High)
Warning sensors:
        - c0/v0_Consist (No)
        - c0/v1_Consist (No)
CTR information:
        - LSI MegaRAID SAS 9260-8i:
                - Serial No=SV13800819
                - FW Package Build=12.12.0-0111
                - Mfg. Date=09/10/11
                - Revision No=79B
                - BIOS Version=3.24.00_4.12.05.00_0x05160000
                - FW Version=2.130.353-1663
                - ROC temperature=
LD information:
        - c0/v0:
                - Access=RW
                - Cac=-
                - Cache=RWTD
                - Consist=No
                - DG/VD=0/0
                - Size=232.375
                - State=Optl
                - TYPE=RAID1
                - ld=c0/v0
                - sCC=ON
        - c0/v1:
                - Access=RW
                - Cac=-
                - Cache=RWTD
                - Consist=No
                - DG/VD=1/1
                - Size=5.457
                - State=Optl
                - TYPE=RAID5
                - ld=c0/v1
                - sCC=ON
PD information:
        - c0/e252/s0:
                - DG=0
                - DID=14
                - Drive Temperature=N/A
                - EID:Slt=252:0
                - Intf=SATA
                - Med=SSD
                - Media Error Count=0
                - Model=Samsung SSD
                - Other Error Count=0
                - PI=N
                - Predictive Failure Count=0
                - S.M.A.R.T alert flagged by drive=No
                - SED=Y
                - SeSz=512B
                - Shield Counter=0
                - Size=232.375GB
                - Sp=850
                - State=Onln
                - pd=c0/e252/s0
        - c0/e252/s1:
                - DG=0
                - DID=15
                - Drive Temperature=N/A
                - EID:Slt=252:1
                - Intf=SATA
                - Med=SSD
                - Media Error Count=0
                - Model=Samsung SSD
                - Other Error Count=0
                - PI=N
                - Predictive Failure Count=0
                - S.M.A.R.T alert flagged by drive=No
                - SED=Y
                - SeSz=512B
                - Shield Counter=0
                - Size=232.375GB
                - Sp=850
                - State=Onln
                - pd=c0/e252/s1
        - c0/e252/s4:
                - DG=1
                - DID=16
                - Drive Temperature=34C (93.20 F)
                - EID:Slt=252:4
                - Intf=SATA
                - Med=HDD
                - Media Error Count=0
                - Model=WDC WD30EFRX-68EUZN0
                - Other Error Count=0
                - PI=N
                - Predictive Failure Count=0
                - S.M.A.R.T alert flagged by drive=No
                - SED=N
                - SeSz=512B
                - Shield Counter=0
                - Size=2.728TB
                - Sp=U
                - State=Onln
                - pd=c0/e252/s4
        - c0/e252/s5:
                - DG=1
                - DID=17
                - Drive Temperature=36C (96.80 F)
                - EID:Slt=252:5
                - Intf=SATA
                - Med=HDD
                - Media Error Count=0
                - Model=WDC WD30EFRX-68EUZN0
                - Other Error Count=0
                - PI=N
                - Predictive Failure Count=0
                - S.M.A.R.T alert flagged by drive=No
                - SED=N
                - SeSz=512B
                - Shield Counter=0
                - Size=2.728TB
                - Sp=U
                - State=Onln
                - pd=c0/e252/s5
        - c0/e252/s6:
                - DG=1
                - DID=18
                - Drive Temperature=34C (93.20 F)
                - EID:Slt=252:6
                - Intf=SATA
                - Med=HDD
                - Media Error Count=0
                - Model=WDC WD30EFRX-68EUZN0
                - Other Error Count=0
                - PI=N
                - Predictive Failure Count=0
                - S.M.A.R.T alert flagged by drive=No
                - SED=N
                - SeSz=512B
                - Shield Counter=0
                - Size=2.728TB
                - Sp=U
                - State=Onln
                - pd=c0/e252/s6
BBU information:
                - BBU_Firmware_temperature=High
                - BBU_Status=Critical
                - BBU_Temperature=46
christianuhlmann commented 6 years ago

hi again,

very strange, today i checked egain and now it works without any changes. should be a temporary problem.