lausser / check_nwc_health

nwc = network component. This plugin checks lots of aspects of routers, switches, wlan controllers, firewalls,.....
http://labs.consol.de/nagios/check_nwc_health
GNU General Public License v2.0
146 stars 88 forks source link

Timeouts when checking interface health in SNMPv3 context #281

Closed Napsty closed 2 years ago

Napsty commented 3 years ago

I've been using an outdated version of the plugin (7.2.0.2) until today, where I wanted to upgrade to a newer version. With the old version, the interface health check on a virtual Checkpoint VSX (using SNMPv3 context) works:

$ ./check_nwc_health.20210610 --version
check_nwc_health.20210610 $Revision: 7.2.0.2 $ [http://labs.consol.de/nagios/check_nwc_health]

$ ./check_nwc_health.20210610 --hostname VSX --protocol 3 --username nagios --authpassword secret --authprotocol md5 --mode list-interfaces --contextname vsid1                
000001 lo
000002 br1
000021 wrpj512
000023 wrpj256
000025 wrpj128
000035 wrpj386
000037 wrpj193
000055 bond1.610
000087 wrpj448
OK - have fun

$ ./check_nwc_health.20210610 --hostname VSX --protocol 3 --username nagios --authpassword secret --authprotocol md5 --mode interface-health --contextname vsid1 --name wrpj386
OK - wrpj386 is up/up, interface wrpj386 usage is in:0.00% (842.58bit/s) out:0.00% (78204.84bit/s), interface wrpj386 errors in:0.00/s out:0.00/s , interface wrpj386 discards in:0.00/s out:0.00/s , interface wrpj386 broadcast in:0.00% out:0.00%  | 'wrpj386_usage_in'=0%;80;90;0;100 'wrpj386_usage_out'=0%;80;90;0;100 'wrpj386_traffic_in'=842.58;0;0;0;0 'wrpj386_traffic_out'=78204.84;0;0;0;0 'wrpj386_errors_in'=0;1;10;; 'wrpj386_errors_out'=0;1;10;; 'wrpj386_discards_in'=0;1;10;; 'wrpj386_discards_out'=0;1;10;; 'wrpj386_broadcast_in'=0%;10;20;0;100 'wrpj386_broadcast_out'=0%;10;20;0;100

But with the newer version (tested with 8.3.2.2) this results in a timeout:

$ ./check_nwc_health --version
check_nwc_health $Revision: 8.3.2.2 $ [http://labs.consol.de/nagios/check_nwc_health]

$ ./check_nwc_health --hostname VSX --protocol 3 --username nagios --authpassword secret --authprotocol md5 --mode list-interfaces --contextname vsid1                        
000001 lo
000002 br1
000021 wrpj512
000023 wrpj256
000025 wrpj128
000035 wrpj386
000037 wrpj193
000055 bond1.610
000087 wrpj448
OK - have fun

$ ./check_nwc_health --hostname VSX --protocol 3 --username nagios --authpassword secret --authprotocol md5 --mode interface-health --contextname vsid1 --name wrpj386                       
UNKNOWN - check_nwc_health timed out after 15 seconds

The timeout also happens with a newer 7.x version (tested with 7.10.3).

Napsty commented 3 years ago

Diff can be seen with very verbose mode. After the interfaces are successfully listed on new and old version, the following happens.

On old version:

Thu Jun 10 10:24:29 2021: i know package Monitoring::GLPlugin::SNMP::MibsAndOids::IFMIB
Thu Jun 10 10:24:29 2021: get_snmp_table_objects IFMIB ifTable+ifXTable
Thu Jun 10 10:24:29 2021: i know package Monitoring::GLPlugin::SNMP::MibsAndOids::IFMIB
Thu Jun 10 10:24:29 2021: i know package Monitoring::GLPlugin::SNMP::MibsAndOids::IFMIB
Thu Jun 10 10:24:29 2021: get_entries $VAR1 = {
  '-endindex' => '35',
  '-startindex' => '35',
  '-columns' => [
    '1.3.6.1.2.1.2.2.1.2',
    '1.3.6.1.2.1.2.2.1.5',
    '1.3.6.1.2.1.2.2.1.7',
    '1.3.6.1.2.1.2.2.1.8',
    '1.3.6.1.2.1.2.2.1.10',
    '1.3.6.1.2.1.2.2.1.11',
    '1.3.6.1.2.1.2.2.1.13',
    '1.3.6.1.2.1.2.2.1.14',
    '1.3.6.1.2.1.2.2.1.16',
    '1.3.6.1.2.1.2.2.1.17',
    '1.3.6.1.2.1.2.2.1.19',
    '1.3.6.1.2.1.2.2.1.20'
  ]
};

Thu Jun 10 10:24:29 2021: get_entries_get_bulk $VAR1 = {
  '-columns' => [
    '1.3.6.1.2.1.2.2.1.2',
    '1.3.6.1.2.1.2.2.1.5',
    '1.3.6.1.2.1.2.2.1.7',
    '1.3.6.1.2.1.2.2.1.8',
    '1.3.6.1.2.1.2.2.1.10',
    '1.3.6.1.2.1.2.2.1.11',
    '1.3.6.1.2.1.2.2.1.13',
    '1.3.6.1.2.1.2.2.1.14',
    '1.3.6.1.2.1.2.2.1.16',
    '1.3.6.1.2.1.2.2.1.17',
    '1.3.6.1.2.1.2.2.1.19',
    '1.3.6.1.2.1.2.2.1.20'
  ],
  '-startindex' => '35',
  '-endindex' => '35'
};

Thu Jun 10 10:24:29 2021: get_entries $VAR1 = {
  '-columns' => [
    '1.3.6.1.2.1.31.1.1.1.1',
    '1.3.6.1.2.1.31.1.1.1.2',
    '1.3.6.1.2.1.31.1.1.1.3',
    '1.3.6.1.2.1.31.1.1.1.4',
    '1.3.6.1.2.1.31.1.1.1.5',
    '1.3.6.1.2.1.31.1.1.1.6',
    '1.3.6.1.2.1.31.1.1.1.7',
    '1.3.6.1.2.1.31.1.1.1.8',
    '1.3.6.1.2.1.31.1.1.1.9',
    '1.3.6.1.2.1.31.1.1.1.10',
    '1.3.6.1.2.1.31.1.1.1.11',
    '1.3.6.1.2.1.31.1.1.1.12',
    '1.3.6.1.2.1.31.1.1.1.13',
    '1.3.6.1.2.1.31.1.1.1.15',
    '1.3.6.1.2.1.31.1.1.1.18'
  ],
  '-startindex' => '35',
  '-endindex' => '35'
};

Thu Jun 10 10:24:29 2021: get_entries_get_bulk $VAR1 = {
  '-endindex' => '35',
  '-columns' => [
    '1.3.6.1.2.1.31.1.1.1.1',
    '1.3.6.1.2.1.31.1.1.1.2',
    '1.3.6.1.2.1.31.1.1.1.3',
    '1.3.6.1.2.1.31.1.1.1.4',
    '1.3.6.1.2.1.31.1.1.1.5',
    '1.3.6.1.2.1.31.1.1.1.6',
    '1.3.6.1.2.1.31.1.1.1.7',
    '1.3.6.1.2.1.31.1.1.1.8',
    '1.3.6.1.2.1.31.1.1.1.9',
    '1.3.6.1.2.1.31.1.1.1.10',
    '1.3.6.1.2.1.31.1.1.1.11',
    '1.3.6.1.2.1.31.1.1.1.12',
    '1.3.6.1.2.1.31.1.1.1.13',
    '1.3.6.1.2.1.31.1.1.1.15',
    '1.3.6.1.2.1.31.1.1.1.18'
  ],
  '-startindex' => '35'
};

Thu Jun 10 10:24:29 2021: i know package Monitoring::GLPlugin::SNMP::MibsAndOids::IFMIB
Thu Jun 10 10:24:29 2021: get_snmp_table_objects single returns 1 entries
Thu Jun 10 10:24:29 2021: load $VAR1 = {
  'localtime' => 'Thu Jun 10 10:24:16 2021',
  'ifHCInOctets' => '171305774756',
  'timestamp' => 1623313456,
  'ifHCOutOctets' => '15887428839995'
};
[...]

On newer version, the plugin is stuck after the OID (there is no OID 1.3.6.1.2.1.31.1.1.1 checked in the new version):

Thu Jun 10 10:25:22 2021: i know package Monitoring::GLPlugin::SNMP::MibsAndOids::IFMIB
Thu Jun 10 10:25:22 2021: get_snmp_table_objects IFMIB ifTable+ifXTable
Thu Jun 10 10:25:22 2021: i know package Monitoring::GLPlugin::SNMP::MibsAndOids::IFMIB
Thu Jun 10 10:25:22 2021: get_snmp_table_objects augment IFMIB ifTable with ifXTable
Thu Jun 10 10:25:22 2021: i know package Monitoring::GLPlugin::SNMP::MibsAndOids::IFMIB
Thu Jun 10 10:25:22 2021: get_entries $VAR1 = {
  '-endindex' => '35',
  '-startindex' => '35',
  '-columns' => [
    '1.3.6.1.2.1.2.2.1.2',
    '1.3.6.1.2.1.2.2.1.5',
    '1.3.6.1.2.1.2.2.1.7',
    '1.3.6.1.2.1.2.2.1.8',
    '1.3.6.1.2.1.2.2.1.10',
    '1.3.6.1.2.1.2.2.1.11',
    '1.3.6.1.2.1.2.2.1.13',
    '1.3.6.1.2.1.2.2.1.14',
    '1.3.6.1.2.1.2.2.1.16',
    '1.3.6.1.2.1.2.2.1.17',
    '1.3.6.1.2.1.2.2.1.19',
    '1.3.6.1.2.1.2.2.1.20'
  ]
};

Thu Jun 10 10:25:22 2021: get_entries_get_bulk $VAR1 = {
  '-endindex' => '35',
  '-maxrepetitions' => 10,
  '-columns' => [
    '1.3.6.1.2.1.2.2.1.2',
    '1.3.6.1.2.1.2.2.1.5',
    '1.3.6.1.2.1.2.2.1.7',
    '1.3.6.1.2.1.2.2.1.8',
    '1.3.6.1.2.1.2.2.1.10',
    '1.3.6.1.2.1.2.2.1.11',
    '1.3.6.1.2.1.2.2.1.13',
    '1.3.6.1.2.1.2.2.1.14',
    '1.3.6.1.2.1.2.2.1.16',
    '1.3.6.1.2.1.2.2.1.17',
    '1.3.6.1.2.1.2.2.1.19',
    '1.3.6.1.2.1.2.2.1.20'
  ],
  '-startindex' => '35'
};

Thu Jun 10 10:25:36 2021: AUTOLOAD Classes::Generic::nagios_exit

UNKNOWN - check_nwc_health timed out after 15 seconds
Napsty commented 2 years ago

Meanwhile we have upgraded VSX from R80.30 to R81.10 and suddenly the timeouts are gone. I tested with different versions of check_nwc_health.

v 7.10.3:

$ ./check_nwc_health --hostname CheckPointFW --protocol 3 --username snmpuser --authpassword secret --authprotocol sha --mode interface-health --contextname vsid2 --name "bond1.513"
OK - bond1.513 is up/up, interface bond1.513 usage is in:0.00% (1187.69bit/s) out:0.00% (0.00bit/s), interface bond1.513 errors in:0.00/s out:0.00/s , interface bond1.513 discards in:0.00/s out:0.00/s , interface bond1.513 broadcast in:0.00% out:0.00%  | 'bond1.513_usage_in'=0.00%;80;90;0;100 'bond1.513_usage_out'=0%;80;90;0;100 'bond1.513_traffic_in'=1187.69;32000000000;36000000000;0;40000000000 'bond1.513_traffic_out'=0;32000000000;36000000000;0;40000000000 'bond1.513_errors_in'=0;1;10;; 'bond1.513_errors_out'=0;1;10;; 'bond1.513_discards_in'=0;1;10;; 'bond1.513_discards_out'=0;1;10;; 'bond1.513_broadcast_in'=0%;10;20;0;100 'bond1.513_broadcast_out'=0%;10;20;0;100

v 8.3.2.2:

$ ./check_nwc_health --hostname CheckPointFW --protocol 3 --username snmpuser --authpassword secret --authprotocol sha --mode interface-health --contextname vsid2 --name "bond1.513"
OK - bond1.513 is up/up, interface bond1.513 usage is in:0.00% (2176.86bit/s) out:0.00% (0.00bit/s), interface bond1.513 errors in:0.00% out:0.00% , interface bond1.513 discards in:0.00% out:0.00% , interface bond1.513 broadcast in:0.00% out:0.00% (% of traffic) in:0.00% out:0.00% (% of bandwidth) | 'bond1.513_usage_in'=0.00%;80;90;0;100 'bond1.513_usage_out'=0%;80;90;0;100 'bond1.513_traffic_in'=2176.86;32000000000;36000000000;0;40000000000 'bond1.513_traffic_out'=0;32000000000;36000000000;0;40000000000 'bond1.513_errors_in'=0%;1;10;0;100 'bond1.513_errors_out'=0%;1;10;0;100 'bond1.513_discards_in'=0%;5;10;0;100 'bond1.513_discards_out'=0%;5;10;0;100 'bond1.513_broadcast_in'=0%;10;20;0;100 'bond1.513_broadcast_out'=0%;10;20;0;100 'bond1.513_broadcast_usage_in'=0%;10;20;0;100 'bond1.513_broadcast_usage_out'=0%;10;20;0;100

v 10.0.0.2:

$ ./check_nwc_health --hostname CheckPointFW --protocol 3 --username snmpuser --authpassword secret --authprotocol sha --mode interface-health --contextname vsid2 --name "bond1.513"
OK - bond1.513 is up/up, interface bond1.513 usage is in:0.00% (1061.33bit/s) out:0.00% (0.00bit/s), interface bond1.513 errors in:0.00% out:0.00% , interface bond1.513 discards in:0.00% out:0.00% , interface bond1.513 broadcast in:0.00% out:0.00% (% of traffic) in:0.00% out:0.00% (% of bandwidth) | 'bond1.513_usage_in'=0.00%;80;90;0;100 'bond1.513_usage_out'=0%;80;90;0;100 'bond1.513_traffic_in'=1061.33;32000000000;36000000000;0;40000000000 'bond1.513_traffic_out'=0;32000000000;36000000000;0;40000000000 'bond1.513_errors_in'=0%;1;10;0;100 'bond1.513_errors_out'=0%;1;10;0;100 'bond1.513_discards_in'=0%;5;10;0;100 'bond1.513_discards_out'=0%;5;10;0;100 'bond1.513_broadcast_in'=0%;10;20;0;100 'bond1.513_broadcast_out'=0%;10;20;0;100 'bond1.513_broadcast_usage_in'=0%;10;20;0;100 'bond1.513_broadcast_usage_out'=0%;10;20;0;100

Therefore all good! Timeouts were therefore not caused by the plugin.