mickem / nscp

NSClient++
http://nsclient.org
GNU General Public License v2.0
239 stars 94 forks source link

nsclient times out #691

Open ivansmm opened 4 years ago

ivansmm commented 4 years ago

Hallo,

I have random timeout problems when using external powershell script (namely check_SmartArray.ps1) in nrpe mode. Sometimes (one or two times per day) nagios gives the following error when trying to communicate with nsclient:

07-Jun-2020 14:56:53 SERVICE ALERT;RAID status;UNKNOWN;SOFT;1;CHECK_NRPE: Receive header underflow - only 0 bytes received (4 expected).

In nsclient.log I see the following matching entry:

07-Jun-2020 14:56:54: Socket was unexpectedly closed trying to send data (possibly check your timeout settings)

I added some trace message to the script being run:

        $log = 'smartarray.log'

        $ctime = Get-Date
        Add-Content -Path $log -Value "$ctime: running [$prg]\n"

        # Execute Hp program with needed parameters and remove empty lines
        $res = $exec | Where-Object { $_ }
        #Write-Host $res

        $ctime = Get-Date
        Add-Content -Path $log -Value "$ctime: received [$res]\n"

and I see that at the time or errors reported by nagios and nsclient the script was not started at all. But on the next iteration after alert notification was sent by nagios, the script reports that it was started 3 times in sequence durung single second. Can you advise me in which direction to dig?

Nsclient version is 0.5.2.35 2018-01-28, run under windows 10 pro Nagios version is 4.3.4, run under centos7.

Best regards,

mintsoft commented 4 years ago

@ivansmm there could be a misconfiguration in the nsclient.ini how have you defined the check in there?

ivansmm commented 4 years ago

Here's my nsclient.ini (I had to rename it to txt, web interface does not allow to attach .ini file for some reason) nsclient.txt

On nagios side I have the following settings:

service_check_timeout=90 (in nagios.cfg)

$USER1$/check_nrpe -H $HOSTADDRESS$ -t 80 -c check_raid (in command definition)

mintsoft commented 4 years ago

OK; that all looks fine to me. I think that the underlying script must be hanging on some resource somewhere then. I think if you ran it not inside nsclient++ you'd experience the same thing.

Alternatively if there's a huge spike in nrpe requests at the same time you might be overwhelming the NRPE server in nsclient++. If you change this to use check_nsc_web and the REST API does the problem go away?

ivansmm commented 4 years ago

I have added test printouts into script itself and max delay it shows is 19 seconds. Even this happens only when the script was executed after the error occurs. There cannot be any nrpe request spikes since this script is normally invoked once in 10 minutes and the interval is decreased to 2 minutes when an error occurs. No other nagios probes (well, except host icmp probe which is performed without nrpe) are executed. Thanks for check_nsc_web advice, I will try it later. It seems I need to build it manually, no packages for centos are present.