lausser / check_nwc_health

nwc = network component. This plugin checks lots of aspects of routers, switches, wlan controllers, firewalls,.....
http://labs.consol.de/nagios/check_nwc_health
GNU General Public License v2.0
149 stars 88 forks source link

Reporting VPN is down even when it is up - ASA5510 #198

Closed lpossamai closed 5 years ago

lpossamai commented 5 years ago

Hi.

I am checking a site-to-site VPN using your script. The VPN is up (i can see it from the Cisco ASA software) but the check_nwc_health is reporting it as down, as shown below.

check_nwc_health:

root@icinga:/etc/icinga2/zones.d/master# /usr/lib/nagios/plugins/check_nwc_health '--community' 'public' '--hostname' '192.168.99.254' '--mode' 'vpn-status' --name 203.xx.xx.18
CRITICAL - other phase1 failure 26.xx.xx.14->203.xx.xx.18 0h 28m 43s ago, tunnel 26.xx.xx.14 (26.xx.xx.14)->203.xx.xx.18 (203.xx.xx.18) is active

The error "other phase1 failure" is unknown to me. The VPN is up and I can reach the hosts on the other side.

What could be happening?

Cheers!

lpossamai commented 5 years ago

UPDATE:

Just a few seconds I created this issue, the status changed to OK. Even though it is working fine now, I'd like to know why that happened, please.

Cheers!

lpossamai commented 5 years ago

Hi. Another case. screen shot 2019-01-16 at 11 40 55 am

As you can see from the screenshot above, the VPN is up.

However, the script shows it as down:

root@icinga:/etc/icinga2/zones.d/master# /usr/lib/nagios/plugins/check_nwc_health '--community' 'public' '--hostname' '192.168.99.254' '--mode' 'vpn-status' --name 203.xx.xxx.211
CRITICAL - tunnel to 203.xx.xxx.211 does not exist
lausser commented 5 years ago

Run an snmpwalk and check the output. Check_nwc_health has only the information that it gets via SNMP from the device.

(Adding –vvvvvvvvvv to check_nwc_health command line will show you the oids)

Von: Lucas Possamai [mailto:notifications@github.com] Gesendet: Dienstag, 15. Januar 2019 23:44 An: lausser/check_nwc_health check_nwc_health@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Betreff: Re: [lausser/check_nwc_health] Reporting VPN is down even when it is up - ASA5510 (#198)

Hi. Another case. https://user-images.githubusercontent.com/17607576/51214805-d404b800-1983-11e9-85c8-9c43092c8cab.png

As you can see from the screenshot above, the VPN is up.

However, the script shows it as down:

root@icinga:/etc/icinga2/zones.d/master# mailto:root@icinga:/etc/icinga2/zones.d/master# /usr/lib/nagios/plugins/check_nwc_health '--community' 'public' '--hostname' '192.168.99.254' '--mode' 'vpn-status' --name 203.xx.xxx.211 CRITICAL - tunnel to 203.xx.xxx.211 does not exist

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lausser/check_nwc_health/issues/198#issuecomment-454580775 , or mute the thread https://github.com/notifications/unsubscribe-auth/AAMJOMmdRwbTMdjFpVK0oo4cZGkJ9z3nks5vDlm-gaJpZM4Z9exS . https://github.com/notifications/beacon/AAMJOKtt6qvCUg5PH37ev2oxprJA1v-Dks5vDlm-gaJpZM4Z9exS.gif

0xliam commented 3 years ago

Sorry to bump an old issue, but we are running into similar behaviour and I think this is likely what @lpossamai was running into.

When a VPN fails and is reestablished, the check does not recover until some time after the VPN is re-established, as it is looking at entries in the cikeFailTable from CISCO-IPSEC-FLOW-MONITOR-MIB.

The entries int he table are still present even after the tunnel is active, and until they are cleared the check exits as a CRITICAL.

CRITICAL - peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 21m 25s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 12m 20s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 11m 19s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 10m 19s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 9m 19s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 20m 24s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 19m 24s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 18m 23s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 17m 23s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 16m 22s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 15m 21s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 14m 21s ago
peerLost phase1 failure 1.2.3.4->5.6.7.8 0h 13m 20s ago
tunnel 1.2.3.4 (redacted)->5.6.7.8 is active

image

It looks like those entries were cleared out roughly after 45 minutes for us.

image

This is on a Cisco CISCO1941/K9.

R02#show version 
Cisco IOS Software, C1900 Software (C1900-UNIVERSALK9-M), Version 15.5(3)M3, RELEASE SOFTWARE (fc2)

I'm not aware of a way to clear the table from the CLI, but it would make more sense for the plugin to only check the current status and negate any errors if the tunnel is active when the check runs.