naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
153 stars 63 forks source link

Unable to process passive service checks with nrdp #267

Closed bshaw closed 5 years ago

bshaw commented 5 years ago

Hi all-

We want to migrate our monitoring core from Nagios to Naemon. We use NCPA as out remote agent, so passive checks are submitted via NRDP. We do not have the option to convert the agents to nsclient++.

There is a bug with processing passive service checks when using nrdp (passive host checks are fine). Looking at the debug log from NRDP shows that it is receiving the proper check results, however, Naemon sets the service to critical with the message (Service check timed out after 0.00 seconds).

All of my searching points me to a bug which was fixed in Nagios core 4.0.3, but which has never been addressed in the Naemon core. Here's a link to the diff on the Nagios side - based on what I'm seeing, this would correspond to checks_service.c in the Naemon code: https://sourceforge.net/p/nagios/nagioscore/ci/5b8ad80519405cdf004229b04327bfb1f1d5fd9c/

Are there any known work-arounds for this issue? Would it be possible to get the fix into a future release?

Thanks!

nook24 commented 5 years ago

I'm not familiar with NCPA/NRDP. How does this transfer the check results to Naemon? Does it make use of naemon.cmd or does it use check_result_path from naemon.cfg?

bshaw commented 5 years ago

Thanks for the quick reply!

NRDP: https://github.com/NagiosEnterprises/nrdp

It uses the check_result_path. Interestingly, it is working fine with passive host checks, only failing for services. This tells me my permissions are already good, but including some info here for completeness, anyway:

root@naemontst:~# ls -ld /var/cache/naemon/checkresults/
drwxrwsr-x 2 naemon naemon 4096 Nov  1 07:03 /var/cache/naemon/checkresults/

root@naemontst:~# ls -la /var/lib/naemon/naemon.cmd
prw-rw---- 1 naemon naemon 0 Oct 29 15:31 /var/lib/naemon/naemon.cmd

root@naemontst:~# grep naemon /etc/group
naemon:x:116:naemon,www-data
nook24 commented 5 years ago

First of all, i am not associated with Nagios or Naemon. How ever. From my experience the only thing I can say is do not use check_result_path. This worked quite well with Nagios 3, back in the old days... With Nagios 4 I run into strange Bus error behaviors and with Naemon I got Segmentation fault issues.

It looks like NRDP can also use naemon.cmd so you should try this. I thought check_result_path is deprecated and will be removed? So may be a Naemon dev can give use more information about this.

For passive services I use a configuration like this:

# naemon.cfg
check_service_freshness=1
service_freshness_check_interval=60

# commands.cfg
define command{
    command_name                       check_freshness
    command_line                       $USER1$/check_dummy 3 "Service freshness expired"
}

# service.cfg
define service{
    use                                 CHECK_DHCP_TEMPLATE
    host_name                           example host
    name                                CHECK_DHCP
    display_name                        CHECK_DHCP
    service_description                 CHECK_DHCP

    ;Check settings:
    check_command                       check_freshness
    check_interval                      300  ;Expect a passive check result every 5 minutes
    active_checks_enabled               0
    passive_checks_enabled              1
    check_freshness                     1 
    freshness_threshold                 600  ;Execute check_freshness if 10 minutes without a passive check result passed.

    ;Everything else:
    servicegroups                       sgroup1,sgroup2
}
bshaw commented 5 years ago

I just skimmed through their code and it looks like there's no option to use the external command file for passive check results. They have it setup so checks go to the chef_result_path and any commands go to the command pipe.

We, luckily, only have a small handful of passive checks, so it will be much simpler to convert those to active and just move on. I would switch us to NSCA, but the effort in re-writing all of our checks would be a massive burden.

I'm interested in hearing from the Naemon folks as to whether they plan to remove the check_results_path bits. If it's not going away anytime soon, maybe they could consider patching up this passive service problem for anyone who currently lives with NRDP (I'm certainly not the only one, right!?)

sni commented 5 years ago

I thought check_result_path is deprecated and will be removed? So may be a Naemon dev can give use more information about this.

No, thats not the case. There is no reason to deprecate the check_result_path. It is used by many addons. If there are segfaults, please raise an issue, best with the check_result itself and a gdb backtrace if possible.

sni commented 5 years ago

btw, could you stop naemon for a few seconds to catch a file from nrdp in your check_results path? It would be intersting to see how those files look like.

nook24 commented 5 years ago

No, thats not the case. There is no reason to deprecate the check_result_path.

Years ago we talked about this issue. May be in the IRC channel because i can't find anything about this in my mails or github issues. At this time, some one from op5 came to me and told me i should use naemon.qh.

How ever, i found that check_result_path was removed with Naemon 1.0.4: Stop logging if check_result_path (deprecated) is not available even if it’s set Also gone in the example configuration.

With Naemon 1.0.7 it is back again: Undeprecate check_result_path.

Maybe this is the reason why one of my Naemon 1.0.7 systems struges around with dying workers and Bad file descriptor errors in strace? My fix was a downgrate to 1.0.6

My experiences: Handling of passive check results is a pain with Nagios (and only a bit better with Naemon) naemon.cmd

check_result_path

naemon.qh (best solution so far)

So at the moment, a friend of mine is working on a broker module that can handle passive check results.


But all this not helps @bshaw ^^

sni commented 5 years ago

Yes, we reverted deprecating check_result_path. I think its useful. Btw, some more offtopic, we usually use mod-gearman/send_gearman to submit passive check results. Never had issues so far. Again, if using the check_result_path crashes naemon, please open an issue.

bshaw commented 5 years ago

Here you go!

A couple simple services - these are what fail:

### NRDP Check ###
start_time=1541192382.0
# Time: Fri, 02 Nov 2018 20:59:42 +0000
host_name=servername
service_description=CPU Usage
check_type=1
early_timeout=1
exited_ok=1
return_code=0
output=OK: Percent was 0.00 %, 2.00 % | 'percent_0'=0.00%;85;90; 'percent_1'=2.00%;85;90;\n

### NRDP Check ###
start_time=1541192732.0
# Time: Fri, 02 Nov 2018 21:05:32 +0000
host_name=servername
service_description=Memory Usage
check_type=1
early_timeout=1
exited_ok=1
return_code=0
output=OK: Used memory was 16.30 % (Available: 3.22 GiB, Total: 3.85 GiB, Free: 2.69 GiB, Used: 0.38 GiB) | 'available'=3.22GiB;3;3; 'total'=3.85GiB;3;3; 'free'=2.69GiB;3;3; 'used'=0.38GiB;3;3;\n

And a host - these work fine:

### NRDP Check ###
start_time=1541192527.0
# Time: Fri, 02 Nov 2018 21:02:07 +0000
host_name=servername
check_type=1
early_timeout=1
exited_ok=1
return_code=0
output=OK: Agent_version was ['2.1.3']\n
nook24 commented 5 years ago

@bshaw your check result file looks not to bad for me.

The files my generator build looks like this:

### Passive Check Result File ###
file_time=1541197701

### Passive-Injection ###
# Time: Fri, 02 Nov 2018 20:59:42 +0000
host_name=Foobar
service_description=Ping
check_type=1
early_timeout=0
exited_ok=1
start_time=1541197701
finish_time=1541197701
return_code=0
output=PING OK - Packet loss = 0%, RTA = 0.12 ms | rta=0.119000ms;100.000000;500.000000;0.000000 pl=0%;20;60;0 

The only difference i see is that early_timeout is 0 in my file (because this is hardcoded in my generator). I guess a passive check can not early timeout, because it is not executed by the system? In NRDP it is hardcoded to 1: https://github.com/NagiosEnterprises/nrdp/blob/3390032562e4f3dbdfd200dae9116ab5210681cd/server/plugins/nagioscorepassivecheck/nagioscorepassivecheck.inc.php#L241 @sni what are you thoughts on this?


Some more offtopic - I'm sorry.

Again, if using the check_result_path crashes naemon, please open an issue.

I will do this. Will see when i find some time to test this again.

we usually use mod-gearman/send_gearman to submit passive check results.

You are right. I also tested mod_gearman for this kind of job. Unfortunately like naemon.cmd, mod_gearman is a UDP like method. So i have no idea if my results reached naemon and if they get processed or not. I ended up with having a few hundred thousand or million records in the mod_gearman queue. (For example if Naemon dies for some reason)

So a modern way of passing passive check results would be super awesome. Something not text file / pipe based or one way communication only. Maybe an JSON HTTP API? :)

I guess I facing issues like this because my systems have 40k+ passive services. Most of the time everything works well but all this external command stuff doesn't feel super solid...

sni commented 5 years ago

my thougts are, this should be fixed in nrdp. You even found the bug there already. It does not make sense to set early_timeout to TRUE and then expect the core to fix it.