mickem / nscp

NSClient++
http://nsclient.org
GNU General Public License v2.0
243 stars 94 forks source link

Unsolicited reponse received with check_nsc_web #550

Closed bmccorkle closed 6 years ago

bmccorkle commented 6 years ago

Issue and Steps to Reproduce

Setting up a brand new Icinga2 installation and I was trying to get service checks to work with the new REST API. I started noticing on Icinga2 that a fair amount of the time the service checks are giving a "<Terminated by signal 9 (Killed).>" with the check_nscp_api native check Icinga has. I tried to use the check_nsc_web plugin to troubleshoot and sometimes the check completes, sometimes it gives a "Unsolicited response received on idle HTTP channel"

PLEASE PROVIDE COMMAND HERE ./check_nsc_web-master -k -p passphrase -u https://192.168.1.10:8445 check_cpu

Expected Behavior

OK: CPU load is ok.|'total 5m'=0%;80;90;; 'total 1m'=0%;80;90;; 'total 5s'=0%;80;90;;

Actual Behavior

2018/03/01 15:23:04 Unsolicited response received on idle HTTP channel starting with "\x00k\x00e\x00t\x00s\x00 \x00i\x00s\x00 \x00t\x00h\x00e\x00 \x00t\x00o\x00t\x00a\x00l\x00 \x00n\x00u\x00m\x00b\x00e\x00r\x00 \x00o\x00f\x00 \x00p\x00a\x00c\x00k\x00e\x00t\x00s\x00 \x00r\x00e\x00c\x00e\x00i\x00v\x00e\x00d\x00 \x00a\x00n\x00d\x00 \x00d\x00i\x00s\x00c\x00a\x00r\x00d\x00e\x00d\x00 \x00o\x00n\x00 \x00a\x00 \x00p\x00u\x00b\x00l\x00i\x00c\x00 \x00i\x00n\x00t\x00e\x00r\x00f\x00a\x00c\x00e\x00.\x00 \x00T\x00h\x00i\x00s\x00 \x00c\x00o\x00u\x00n\x00t\x00e\x00r\x00 \x00r\x00e\x00p\x00r\x00e\x00s\x00e\x00n\x00t\x00s\x00 \x00t\x00h\x00e\x00 \x00s\x00u\x00m\x00m\x00a\x00t\x00i\x00o\x00n\x00 \x00o\x00f\x00 \x00a\x00l\x00l\x00eck_uptime, check_version, check_warning, checkalwayscritical, checkalwaysok, checkalwayswarning, checkcounter, checkcpu, checkcritical, checkdrivesize, checkeventlog, checkfiles, checkmem, checkmultiple, checkok, checkprocstate, checkservicestate, check"; err= OK: CPU load is ok.|'total 5m'=0%;80;90;; 'total 1m'=0%;80;90;; 'total 5s'=0%;80;90;;

Details

Additional Details

NSClient++ log:


2018-03-01 15:26:53: debug:c:\source\master\service\NSClient++.cpp:219: On crash: restart: nscp
2018-03-01 15:26:53: debug:c:\source\master\service\NSClient++.cpp:231: Archiving crash dumps in: C:\Program Files\NSClient++/crash-dumps
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:150: Found: CheckDisk
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:150: Found: CheckEventLog
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:150: Found: CheckExternalScripts
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:150: Found: CheckHelpers
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:150: Found: CheckSystem
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:150: Found: NSCAClient
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:150: Found: WEBServer
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:306: Loading module C:\Program Files\NSClient++\modules\CheckDisk.dll ()
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:306: Loading module C:\Program Files\NSClient++\modules\CheckEventLog.dll ()
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:306: Loading module C:\Program Files\NSClient++\modules\CheckExternalScripts.dll ()
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:306: Loading module C:\Program Files\NSClient++\modules\CheckHelpers.dll ()
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:306: Loading module C:\Program Files\NSClient++\modules\CheckSystem.dll ()
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:306: Loading module C:\Program Files\NSClient++\modules\NSCAClient.dll ()
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:306: Loading module C:\Program Files\NSClient++\modules\WEBServer.dll ()
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:227: Loading plugin: CheckDisk
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:227: Loading plugin: CheckEventLog
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:227: Loading plugin: CheckExternalScripts
2018-03-01 15:26:53: debug:c:\source\master\modules\CheckExternalScripts\CheckExternalScripts.cpp:127: No aliases found (adding default)
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:227: Loading plugin: CheckHelpers
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:227: Loading plugin: CheckSystem
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:227: Loading plugin: NSCAClient
2018-03-01 15:26:53: debug:c:\source\master\service\plugin_manager.cpp:227: Loading plugin: WEBServer
2018-03-01 15:26:53: debug:c:\source\master\modules\WEBServer\WEBServer.cpp:161: Using certificate: C:\Program Files\NSClient++/security/certificate.pem
2018-03-01 15:26:53: debug:c:\source\master\modules\WEBServer\WEBServer.cpp:185: Loading webserver on port: 8445
2018-03-01 15:26:53: debug:c:\source\master\service\NSClient++.cpp:317: NSClient++ - 0.5.2.39 2018-02-04 Started!
2018-03-01 15:26:53: debug:c:\source\master\service\NSClient++.cpp:441: Starting: DONE
2018-03-01 15:26:54: debug:c:\source\master\modules\CheckSystem\pdh_thread.cpp:169: Loading counter: disk_queue_length_0 C: = \\Srv\PhysicalDisk(0 C:)\% Disk Time
2018-03-01 15:26:54: debug:c:\source\master\modules\CheckSystem\pdh_thread.cpp:169: Loading counter: disk_queue_length__Total = \\Srv\PhysicalDisk(_Total)\% Disk Time
bmccorkle commented 6 years ago

Okay, I played around with it today and also tried 0.5.2.35 (same problem) but then went to 0.5.1.46 and the problem hasn't exhibited itself so far either on Icinga2 or manually running check_nsc_web. Looks like I'll stick with the 5.1 till 5.2 is fully released.

mintsoft commented 6 years ago

Interesting, I use check_nsc_web on about 100 servers from naemon and don't have this behaviour against 0.5.2.35, is that error being recorded within icinga? Are the check_nsc_web processes reused rather than being killed off and re-executed somehow?

mickem commented 6 years ago

Does anyone know what URL it is hitting? Looks like it is getting protobuf data back (instead of json)

bmccorkle commented 6 years ago

Unfortunately, I don't have a second server to test this on at the moment. This is a new deployment I'm working on setting up. Once I get ready to install nsclient on a second server I can load 5.2.35 again and try to get you some additional logs.

philomory commented 6 years ago

I suspect this is related to https://github.com/golang/go/issues/19895; depending on how you want to look at it, this is either an issue with NSClient++ (if it's sending not-to-HTTP-spec data), or with check_nsc_web (for not gracefully suppressing Go's default output when receiving such data), or both. I've filed an issue with check_nsc_web, https://github.com/m-kraus/check_nsc_web/issues/7.

As a workaround, here's a dead-simple wrapper script appears to deal with the issue while waiting for a real fix:

#!/usr/bin/env bash

# See https://github.com/m-kraus/check_nsc_web/issues/7
# and also https://github.com/mickem/nscp/issues/550
# and also https://github.com/golang/go/issues/19895
/usr/lib64/nagios/plugins/check_nsc_web "$@" |& grep -v "Unsolicited response received on idle HTTP channel"
exit ${PIPESTATUS[0]}

Obviously adjust the path appropriately if you've put check_nsc_web somewhere else.

philomory commented 6 years ago

Incidentally, it's worth noting the that in the example posted above, if you remove the null bytes you get entirely intelligible text: "kets is the total number of packets received and discarded on a public interface. This counter represents the summation of all". I'm not really a C# guy, but that almost sort of looks like a buffer overflow?

m-kraus commented 6 years ago

Currently check_nsc_web queries the legacy api like so ("-v" enables verbose output):

$ ./check_nsc_web -v -u https://HOST:PORT -p PASSWORD -k -t 30 check_cpu
REQUEST:
"GET /query/check_cpu HTTP/1.1\r\nHost: HOST:PORT\r\nUser-Agent: Go-http-client/1.1\r\nPassword: PASSWORD\r\nAccept-Encoding: gzip\r\n\r\n"
RESPONSE:
"HTTP/1.0 200 200\r\nConnection: close\r\nContent-Length: 379\r\n\r\n{\"header\":{\"source_id\":\"\"},\"payload\":[{\"command\":\"check_cpu\",\"lines\":[{\"message\":\"OK: CPU load is ok.\",\"perf\":[{\"alias\":\"total 5m\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":6,\"warning\":80}},{\"alias\":\"total 1m\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":4,\"warning\":80}},{\"alias\":\"total 5s\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":4,\"warning\":80}}]}],\"result\":\"OK\"}]}"
OK: CPU load is ok.|'total 5m'=6%;80;90;; 'total 1m'=4%;80;90;; 'total 5s'=4%;80;90;;

Or with filters:

$ ./check_nsc_web -v -u https://HOST:PORT -p PASSWORD -k -t 30 check_cpu 'filter=none' 'time=5m' 'time=15m' 'time=60m'
REQUEST:
"GET /query/check_cpu?filter=none&time=5m&time=15m&time=60m HTTP/1.1\r\nHost: HOST:PORT\r\nUser-Agent: Go-http-client/1.1\r\nPassword: PASSWORD\r\nAccept-Encoding: gzip\r\n\r\n"
RESPONSE:
"HTTP/1.0 200 200\r\nConnection: close\r\nContent-Length: 889\r\n\r\n{\"header\":{\"source_id\":\"\"},\"payload\":[{\"command\":\"check_cpu\",\"lines\":[{\"message\":\"OK: CPU load is ok.\",\"perf\":[{\"alias\":\"core 0 5m\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":7,\"warning\":80}},{\"alias\":\"core 1 5m\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":2,\"warning\":80}},{\"alias\":\"total 5m\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":5,\"warning\":80}},{\"alias\":\"core 0 15m\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":8,\"warning\":80}},{\"alias\":\"core 1 15m\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":3,\"warning\":80}},{\"alias\":\"total 15m\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":6,\"warning\":80}},{\"alias\":\"core 0 60m\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":8,\"warning\":80}},{\"alias\":\"core 1 60m\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":4,\"warning\":80}},{\"alias\":\"total 60m\",\"int_value\":{\"critical\":90,\"unit\":\"%\",\"value\":6,\"warning\":80}}]}],\"result\":\"OK\"}]}"
OK: CPU load is ok.|'core 0 5m'=7%;80;90;; 'core 1 5m'=2%;80;90;; 'total 5m'=5%;80;90;; 'core 0 15m'=8%;80;90;; 'core 1 15m'=3%;80;90;; 'total 15m'=6%;80;90;; 'core 0 60m'=8%;80;90;; 'core 1 60m'=4%;80;90;; 'total 60m'=6%;80;90;;

Currently on all my customer sites only NSClient versions below 0.5.2 are in use. The problem doesn't occur there.

mickem commented 6 years ago

As a reference, this issue is only available on large payloads so most likely thats why some have not seen it. for instance if you run check_cpu it work, but you run without a command it hits the root index.html which is "large".

jvigna commented 6 years ago

Hi Michael,

I used your last stable version (0.5.2.35) and also with check_cpu I get this error, but strangely only if I for some time do not execute the check. If I execute the check say every 5 seconds then I have the expected behaviour without problems. If I leave it for say 2 Minutes then execute the check again I have a longer wait of some seconds and then the right or wrong answer randomly.

./check_nsc_web -k -p "*****" -u "https://myhost:8443" checkcpu 2018/05/07 16:16:35 Unsolicited response received on idle HTTP channel starting with "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\x00\x00\x00\x00\x00\x00\x00\x05D\xf6\ad\x03\x00\x80s\x00e\x00l\x00e\x00c\x00t\x00 \x00N\x00e\x00t\x00C\x00o\x00n\x00n\x00e\x00c\x00t\x00i\x00o\x00n\x00I\x00D\x00,\x00 \x00M\x00A\x00C\x00A\x00d\x00d\x00r\x00e\x00s\x00s\x00,\x00 \x00N\x00a\x00m\x00e\x00,\x00 \x00N\x00e\x00t\x00C\x00o\x00n\x00n\x00e\x00c\x00t\x00i\x00o\x00n\x00S\x00t\x00a\x00t\x00u\x00s\x00,\x00 \x00N\x00e\x00t\x00E\x00n\x00a\x00b\x00l\x00e\x00d\x00,\x00 \x00S\x00p\x00e\x00e\x00d\x00 \x00f\x00r\x00o\x00m\x00 \x00W\x00i\x00n\x003\x002\x00\x00\x00\x00\x00v\x02\x00\x00~\x02\x00\x00\x01\b\x00\x00\x00\x8b\x02\x00\x00\x9b\x02\x00\x00\x01\b\x00\x00\x00\xab\x02\x00\x00\xbb\x02\x00\x00\x01\x03\x00\x00\x00\x00\x05A\x10\xc8\x02\x00\x00\x01\x03\x00\x00\x00\x00\x00\x00\x00\xd3\x02\x00\x00\x01\x03\x00\x00\x00\x00\x00\x00\x00\xde\x02\x00\x00\x01\x03\x00\x00\x00\xfc\xff\xff\xff\xec\x02\x00\x00\x01\x03\x00\x00\x00d\x00\x00\x00\x00uint64\x00\x00DisplayName\x00\x00Bytes Sent/sec\x00\x00DisplayName009\x00\x00Bytes Sent/sec\x00\x00CounterType\x00\x00PerfIndex\x00\x00HelpIndex\x00\x00DefaultScale\x00\x00DetailLevel\x00\x00BytesTotalPersec\x00\x15\x00\x00\x00\x02\x00\x10"; err= OK: CPU load is ok.|'total 5m'=4%;80;90;; 'total 1m'=5%;80;90;; 'total 5s'=5%;80;90;;

mickem commented 6 years ago

This was fixed in 0.5.2.35 (as well as some recent 0.5.3 version) https://github.com/mickem/nscp/releases/tag/0.5.2.41

jvigna commented 6 years ago

Yes I tried the new 0.5.2.41 version and can confirm that this fixes the problem! Thanks a lot!