naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
154 stars 63 forks source link

Submitting Passive Checks by IP #363

Open eschoeller opened 3 years ago

eschoeller commented 3 years ago

(I originally posted this issue to naemon-users)

I am currently working on migrating a system from Nagios Core 3.4.1 + Merlin to Naemon Core 1.2.4 + Merlin. We use the system to handle SNMP traps, and this is done through SNMPTT (version 1.4), and hence we submit the results as passive check results.

Some of our devices that send traps live behind a Linux machine masquerading traffic over an OpenVPN link. When these traps traverse the masquerading host / VPN link snmptrapd interprets the message slightly differently than if it traversed the network normally.

Here's a trap that traverses over the NAT/VPN path:

May 27 16:34:20 nagios-host snmptrapd[2780]: 2021-05-27 16:34:20 X.X.X.X(via UDP: [Y.Y.Y.Y]:49070->[Z.Z.Z.Z]) TRAP, SNMP v1, community public#012#011PowerNet-MIB::apc Enterprise Specific Trap (PowerNet-MIB::apcTestTrap) Uptime: 141 days, 0:21:54.90#012#011DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1218371490) 141 days, 0:21:54.90#011SNMPv2-MIB::snmpTrapOID.0 = OID: PowerNet-MIB::apcTestTrap

And here is one that doesn't traverse the NAT/VPN path:

May 27 16:35:18 nagios-host snmptrapd[2780]: 2021-05-27 16:35:18 <UNKNOWN> [UDP: [A.A.A.A]:60128->[Z.Z.Z.Z]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1354830458) 156 days, 19:25:04.58#011SNMPv2-MIB::snmpTrapOID.0 = OID: Sentry4-MIB::st4PhasePowerFactorEvent#011Sentry4-MIB::st4SystemLocation.0 = STRING: SPSC N190 F27 (FACMAN)#011Sentry4-MIB::st4PhaseID.2.1.2 = STRING: BA2#011Sentry4-MIB::st4PhaseLabel.2.1.2 = STRING: BA:L2-L3#011Sentry4-MIB::st4PhasePowerFactor.2.1.2 = INTEGER: 19 hundredths#011Sentry4-MIB::st4PhasePowerFactorStatus.2.1.2 = INTEGER: lowAlarm(14)#011Sentry4-MIB::st4EventStatusText.0 = STRING: Low Alarm#011Sentry4-MIB::st4EventStatusCondition.0 = INTEGER: error(1)

The key difference is here:

X.X.X.X(via UDP: [Y.Y.Y.Y]:49070->[Z.Z.Z.Z])

versus:

<UNKNOWN> [UDP: [A.A.A.A]:60128->[Z.Z.Z.Z]]

For reasons which still remain a mystery to me the packets which are NAT'd are identified differently by snmptrapd than those which are not. When those hit SNMPTT, the Y.Y.Y.Y address was being used (the "host name") instead of the actual address of the device "X.X.X.X". We switched to using the "$aA" variable in SNMPTT which is the "Trap agent IP address" instead of the default "$ar" which is simply "IP address".

We would then submit a passive check result with the actual "trap agent IP address" which would map back to the correct device in our Nagios config. Now it appears that Naemon will not accept IP addresses in a passive check submission, rather only a valid host_name. Trying to submit using an IP address yields:

(Failed validation of service as type service (argument 0)) 

And understandably so - IP addresses aren't required to be unique in either a Nagios or Naemon config but host_name variables certainly are. So a conflict could occur when trying to submit a passive check result to an IP address which is defined more than once. Would the first match get the check result? Would both? Would neither get the result and an error be generated? Certainly if such functionality would be implemented I think there'd be a specific configuration directive in naemon.cfg to enable "Passive Checks by IP", with a warning that if conflicts occur a certain behavior would exist to handle those.

To further complicate my own issues, none of our devices are in DNS - and can't be. And, we cannot enable DNS lookups in SNMPTT either, for performance and reliability reasons. After I did populate all of our devices into a local /etc/hosts file, and switched SNMPTT to use the "$A" variable (the trap agent host name) the passive checks started working - but not for any of the traps traversing the NAT/OpenVPN link. Those remained as IP addresses, despite having entries in the /etc/hosts file. So, there is potentially an SNMPTT bug mixed in there. And, overall I'd prefer not to manage a static /etc/hosts file, either.

Sorry for the long-drawn out explanation of this issue. I hope it's helpful to frame my position on the matter.

nook24 commented 3 years ago

This ain't be a hacky approach but this decision is up to you^^

Your SNMPTT is probably execution a script (or whatever) to pass the traps as passive checks to Naemon. Naemon (and Nagios as well) has a file called objects.cache which contains the complete currently used objects configuration. With a simple script you can use this to get a complete list of all hostnames and ip addresses out of the current Naemon configuration. You can store this list into a text file, database, redis or whatever.

You can then add a lookup method to the script that gets executed by SNMPTT to find the right ip address. I hacked a little parser script together which shows what I mean: https://github.com/it-novum/naemon-objects-cache-parser

VladimirBilik commented 3 years ago

I thing all these info about Naemon objects are retirevable via livestatus with its own query language. Parsing cache would work until somebody decides to change file format.

eschoeller commented 3 years ago

Sorry for the late response ... but mixing in some livestatus magic here sounds like a really good idea. I haven't tinkered with livestatus before, but I'm well aware of what it is. Once I get something working I can share it here, in case someone else encounters this issue.

eschoeller commented 3 years ago

This is really kinda ugly and could use a lot of error checking and improvement ... but in case it helps anyone, just a real quick hack to the generic submit_check_result:

# use live-status to reverse-lookup the IP and return the name configured in nagios. shamelessly take just the first match ... 
HOST=`printf "GET hosts\nColumns: name\nFilter: address = $1\n" | unixcat /var/cache/naemon/live | head -1`

# create the command line to add to the command file
cmdline="[$datetime] PROCESS_SERVICE_CHECK_RESULT;$HOST;$2;$3;$4"

I copied it to submit_trap_check_result so the generic one stays intact for other purposes.

submit_check_result itself is pretty darn simple, so :shrug: ;)

sni commented 3 years ago

you could append a Limit: 1 to the query instead of the head, then naemon won't have to iterate over all hosts and break once it has the first match.

eschoeller commented 3 years ago

Fantastic idea, thanks!

HOST=`printf "GET hosts\nColumns: name\nFilter: address = $1\nLimit: 1\n" | unixcat /var/cache/naemon/live`