Attention! I will deprecate this plugin. It is more than ten years old and does not use the same folder/code structure like my other plugins, which makes them far mor easier to maintain. Soon i will implement check_hpasm's functions in the existing check_hp_health (which knows HP 3PAR now). In order to run before/after-tests, i'd like to ask you for your assistance. Please run snmpwalk -ObentU .... ip-of-ilo 1.3.6.1.2.1 > /tmp/check_hp_health_ip-of-ilo.snmpwalk snmpwalk -ObentU .... ip-of-ilo 1.3.6.1.4.1 >> /tmp/check_hp_health_ip-of-ilo.snmpwalk and mail me the resulting file to gerhard.lausser@consol.de No matter if the devices are running just fine or currently have a problem (of course ilos showing some faulted components are very useful)
This plugin checks the hardware health of HP Proliant servers with the hpasm software installed. It uses the hpasmcli command to acquire the condition of the system's critical components like cpus, power supplies, temperatures, fans and memory modules. Newer versions also use SNMP.
For instructions on installing this plugin for use with Nagios, see below. In addition, generic instructions for the GNU toolchain can be found in the INSTALL file.
For major changes between releases, read the CHANGES file.
For information on detailed changes that have been made, read the Changelog file.
This plugins is self documenting. All plugins that comply with the basic guidelines for development will provide detailed help when invoked with the '-h' or '--help' options.
You can check for the latest plugin at: http://www.consol.de/opensource/nagios/check-hpasm
Send mail to gerhard.lausser@consol.de for assistance.
Please include the OS type and version that you are using.
Also, run the plugin with the '-v' option and provide the resulting
version information. Of course, there may be additional diagnostic information
required as well. Use good judgment.
1) Run the configure script to initialize variables and create a Makefile, etc.
./configure --prefix=BASEDIRECTORY --with-nagios-user=SOMEUSER --with-nagios-group=SOMEGROUP --with-perl=PATH_TO_PERL --with-noinst-level=LEVEL --with-degrees=UNIT --with-perfdata --with-hpacucli
a) Replace BASEDIRECTORY with the path of the directory under which Nagios is installed (default is '/usr/local/nagios') b) Replace SOMEUSER with the name of a user on your system that will be assigned permissions to the installed plugins (default is 'nagios') c) Replace SOMEGRP with the name of a group on your system that will be assigned permissions to the installed plugins (default is 'nagios') d) Replace PATH_TO_PERL with the path where a perl binary can be found. Besides the system wide perl you might have installed a private perl just for the nagios plugins (default is the perl in your path). e) Replace LEVEL with one of ok, warning, critical or unknown. If the required hpasm-rpm is not installed, the check_hpasm plugin will exit with the level specified. If you chose ok, the message will say "ok - .... hpasm is not installed". This is different from the "ok - hardware working fine" if hpasm was found. The default is to treat a missing hpasm package as ok. f) Replace UNIT with one of celsius or fahrenheit. The hpasmcli "show temp" prints temperatures both in units of celsius and fahrenheit. With the --with-degrees option you can decide which units will be shown in an alarm message. The default is "celsius". g) You can tell check_hpasm to output performance data by default if you call configure with the --enable-perfdata option. h) You can tell check_hpasm to check the raid status with the hpacucli command if you call configure with the --enable-hpacucli option. You need the hpacucli rpm.
2) "Compile" the plugin with the following command:
make
This will produce a "check_hpasm" script. You will also find
a "check_hpasm.pl" which you better ignore. It is the base for
the compilation filled with placeholders. These will be replaced during
the make process.
3) Install the compiled plugin script with the following command:
make install
The installation procedure will attempt to place the plugin in a 'libexec/' subdirectory in the base directory you specified with the --prefix argument to the configure script.
4) Verify that your configuration files for Nagios contains the correct paths to the new plugin.
5) Add this line to /etc/sudoers: nagios ALL=NOPASSWD: /sbin/hpasmcli or ths, if you also installed the hpacu package nagios ALL=NOPASSWD: /sbin/hpasmcli, /usr/sbin/hpacucli
-v, --verbose Increased verbosity will print how check_hpasm communicates with the hpasm daemon and which values were acquired.
-t, --timeout The number of seconds after which the plugin will abort.
-b, --blacklist If some components of your system are missing (mostly the secondary power supply bay is empty) and you tolerate this, then blacklist the missing/failed component to avoid false alarms. The value for this option is a slash-separated list of components to ignore. Example: -b p:1,2/f:2/t:3,4/c:1/d:0-1,0-2 means: ignore power supplies #1 and #2, fan #2, temperature #3 and #4, cpu #1 and dimms #1 and #2 in cartridge #0.
-c, --customthresh Override the machine-default temperature thresholds. Example: -c 1:60/4:80/5:50 Sets limit for temperature 1 to 60 degrees, temperature 4 to 80 degrees and temperature 5 to 50 degrees. You get the consecutive numbers by calling check_hpasm -v ... checking temperatures 1 processor_zone temperature is 46 (62 max) 2 cpu#1 temperature is 43 (73 max) 3 i/o_zone temperature is 54 (68 max) 4 cpu#2 temperature is 46 (73 max) 5 power_supply_bay temperature is 38 (55 max)
-p, --perfdata Add performance data to the output even if you did not compile check_hpasm with --with-perfdata in step 1.
Older hardware does not always show valuable information when queried for the health of memory modules. Maybe it's because older modules do not support error checking at all.
Some (older) systems do not support the cpqHeResMemModuleEntry table. Either there is no oid with 1.3.6.1.4.1.232.6.2.14.11.1 at all or there is a single oid like
Example: iso.3.6.1.4.1.232.2.2.4.5.1.3.0.1 = INTEGER: 524288 iso.3.6.1.4.1.232.2.2.4.5.1.3.0.2 = INTEGER: 262144 iso.3.6.1.4.1.232.2.2.4.5.1.3.0.3 = INTEGER: 0 iso.3.6.1.4.1.232.2.2.4.5.1.3.0.4 = INTEGER: 524288 iso.3.6.1.4.1.232.2.2.4.5.1.3.0.5 = INTEGER: 262144 iso.3.6.1.4.1.232.2.2.4.5.1.3.0.6 = INTEGER: 0
^-- module number
^-- cartridge number (0 = system board)
^-- size
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.6 = INTEGER: 0
I compared 300 systems and found out that with
1.3.6.1.4.1.232.6.2.14.11.1.
cpqSiMemECCStatus provides no usable information. All my test systems showed 0 which is an undocumented value.
function get_size(cpqHeResMemModuleEntry) will return 1.
grepping for cpqSiMemBoardSize shows 4 modules iso.3.6.1.4.1.232.2.2.4.5.1.3.0.1 = INTEGER: 262144 iso.3.6.1.4.1.232.2.2.4.5.1.3.0.2 = INTEGER: 262144 iso.3.6.1.4.1.232.2.2.4.5.1.3.0.3 = INTEGER: 0 iso.3.6.1.4.1.232.2.2.4.5.1.3.0.4 = INTEGER: 262144 iso.3.6.1.4.1.232.2.2.4.5.1.3.0.5 = INTEGER: 262144 iso.3.6.1.4.1.232.2.2.4.5.1.3.0.6 = INTEGER: 0
grepping for cpqHeResMemEntry shows one module with zero values iso.3.6.1.4.1.232.6.2.14.11.1.1.0.0 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.2.0.0 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.3.0.0 = "" iso.3.6.1.4.1.232.6.2.14.11.1.4.0.0 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.5.0.0 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.6.0.0 = Hex-STRING: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
cpqSiMemBoardIndex 1.3.6.1.4.1.232.2.2.4.5.1.1 cpqSiMemModuleIndex 1.3.6.1.4.1.232.2.2.4.5.1.2
cpqHeResMemBoardIndex 1.3.6.1.4.1.232.6.2.14.11.1.1 cpqHeResMemModuleIndex 1.3.6.1.4.1.232.6.2.14.11.1.2
cpqSiMemBoardIndex SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.1 = INTEGER: 0 SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.2 = INTEGER: 0 SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.3 = INTEGER: 0 SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.4 = INTEGER: 0 SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.5 = INTEGER: 0 SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.6 = INTEGER: 0
cpqHeResMemBoardIndex SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.1 = INTEGER: 0 SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.2 = INTEGER: 0 SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.3 = INTEGER: 0 SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.4 = INTEGER: 0 SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.5 = INTEGER: 0 SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.6 = INTEGER: 0
It is not possible to use the SNMP-table-indices to identify the corresponding he-entry. Matching is done with nested loops.
cpqSiMemBoardIndex iso.3.6.1.4.1.232.2.2.4.5.1.1.1.1 = INTEGER: 1 iso.3.6.1.4.1.232.2.2.4.5.1.1.1.2 = INTEGER: 1 iso.3.6.1.4.1.232.2.2.4.5.1.1.1.3 = INTEGER: 1 iso.3.6.1.4.1.232.2.2.4.5.1.1.1.4 = INTEGER: 1 iso.3.6.1.4.1.232.2.2.4.5.1.1.1.5 = INTEGER: 1 iso.3.6.1.4.1.232.2.2.4.5.1.1.1.6 = INTEGER: 1 iso.3.6.1.4.1.232.2.2.4.5.1.1.1.7 = INTEGER: 1 iso.3.6.1.4.1.232.2.2.4.5.1.1.1.8 = INTEGER: 1 iso.3.6.1.4.1.232.2.2.4.5.1.1.2.1 = INTEGER: 2 iso.3.6.1.4.1.232.2.2.4.5.1.1.2.2 = INTEGER: 2 iso.3.6.1.4.1.232.2.2.4.5.1.1.2.3 = INTEGER: 2 iso.3.6.1.4.1.232.2.2.4.5.1.1.2.4 = INTEGER: 2 iso.3.6.1.4.1.232.2.2.4.5.1.1.2.5 = INTEGER: 2 iso.3.6.1.4.1.232.2.2.4.5.1.1.2.6 = INTEGER: 2 iso.3.6.1.4.1.232.2.2.4.5.1.1.2.7 = INTEGER: 2 iso.3.6.1.4.1.232.2.2.4.5.1.1.2.8 = INTEGER: 2 iso.3.6.1.4.1.232.2.2.4.5.1.1.3.1 = INTEGER: 3
cpqHeResMemBoardIndex iso.3.6.1.4.1.232.6.2.14.11.1.1.0.1 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.1.0.2 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.1.0.3 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.1.0.4 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.1.0.5 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.1.0.6 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.1.0.7 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.1.0.8 = INTEGER: 0 iso.3.6.1.4.1.232.6.2.14.11.1.1.1.1 = INTEGER: 1 iso.3.6.1.4.1.232.6.2.14.11.1.1.1.2 = INTEGER: 1 iso.3.6.1.4.1.232.6.2.14.11.1.1.1.3 = INTEGER: 1 iso.3.6.1.4.1.232.6.2.14.11.1.1.1.4 = INTEGER: 1 iso.3.6.1.4.1.232.6.2.14.11.1.1.1.5 = INTEGER: 1 iso.3.6.1.4.1.232.6.2.14.11.1.1.1.6 = INTEGER: 1 iso.3.6.1.4.1.232.6.2.14.11.1.1.1.7 = INTEGER: 1 iso.3.6.1.4.1.232.6.2.14.11.1.1.1.8 = INTEGER: 1 iso.3.6.1.4.1.232.6.2.14.11.1.1.2.1 = INTEGER: 2
I saw one old server which had only half of the possible fans installed.
Fan# 1 2 3 4 5 6
cpqHeFltTolFanPresent yes no yes no yes no cpqHeFltTolFanRedundant no no no no no no cpqHeFltTolFanRedundantPartner 2 1 4 3 6 5 cpqHeFltTolFanCondition ok other ok other ok other cpqHeFltTolFanLocation cpu cpu cpu cpu io io
Normally this would result in ... fan #1 (cpu) is not redundant fan #2 (cpu) is not redundant fan #3 (cpu) is not redundant fan #4 (cpu) is not redundant fan #5 (ioboard) is not redundant fan #6 (ioboard) is not redundant WARNING - fan #1 (cpu) is not redundant, fan #2 (cpu) is not redundant, fan #3 (cpu) is not redundant, fan #4 (cpu) is not redundant, fan #5 (ioboard) is not redundant, fan #6 (ioboard) is not redundant
However it was the server's owner decision not to install fan pairs but only one fan per location, so for him this is a false alert.
By using --ignore-fan-redundancy check_hpasm only looks at the cpqHeFltTolFanCondition and ignores dependencies between two fans, so the result is:
fan 1 speed is normal, pctmax is 50%, location is cpu, redundance is no, partner is 2 fan 3 speed is normal, pctmax is 50%, location is cpu, redundance is no, partner is 4 fan 5 speed is normal, pctmax is 50%, location is ioboard, redundance is no, partner is 6 OK - System: 'proliant ml370 g3', ...
local - where check_hpasm runs remote - where a proliant can be reached proliant - where the snmp agent runs
remote: ssh -R6667:localhost:6667 local socat tcp4-listen:6667,reuseaddr,fork UDP:proliant:161
local: socat udp4-listen:161,reuseaddr,fork tcp:localhost:6667 check_hpasm --hostname 127.0.0.1
hpasmcli=$(which hpasmcli) hpacucli=$(which hpacucli) for i in server powersupply fans temp dimm do $hpasmcli -s "show $i" | while read line do printf "%s %s\n" $i "$line" done done if [ -x "$hpacucli" ]; then for i in config status do $hpacucli ctrl all show $i | while read line do printf "%s %s\n" $i "$line" done done fi
If you think check_hpasm is not working correctly, please run the above script and send me the output. It's also helpful to see the output of snmpwalk snmpwalk .... 1.3.6.1.4.1.232
-- Gerhard Lausser gerhard.lausser@consol.de