mickem / nscp

NSClient++
http://nsclient.org
GNU General Public License v2.0
244 stars 94 forks source link

cpu load check through REST API frozen + instability after migrating from 0.5.0.62 to 0.5.1.46 #504

Open Tontonitch opened 6 years ago

Tontonitch commented 6 years ago

Issue and Steps to Reproduce

Using NSCP 0.5.0.62 with Icinga2 and the check_nscp_api.exe plugin, I wanted to upgrade the running NSCP version to last stable 0.5.1.46, to

Unfortunately, the new version introduces:

image

No way to get the cpu usage check working correctly. The other checks, done via Icinga2 launching commands like nscp client…, seem to have continued to work correctly. For example, the memory check:

image

The most important problem is that I tried to revert to the version 0.5.0.62, but the issue still occur! I cannot get my cpu check to work as before, even after a server reboot!

I had that issue on the 2 servers where I did the migration from 0.5.0.62 to 0.5.1.46.

Any idea to fix my situation, at least to get 0.5.0.62 working again?

Expected Behavior

check_cpu should still work via the REST API after migrating from 0.5.0.62 to 0.5.1.46 check_cpu should work again after downgrading to 0.5.0.62

Actual Behavior

check_cpu via the REST API seems to be frozen to the first gathered values after migrating from 0.5.0.62 to 0.5.1.46 check_cpu via the REST API seems to keep the issue after downgrading to 0.5.0.62

Details

Additional Details

NSClient++ log: absolutely nothing in the log

Tontonitch commented 6 years ago

Tested the last version 0.5.2.20, frozen situation still there.

Tontonitch commented 6 years ago

Just to give more details with version 0.5.2.20:

mickem commented 6 years ago

Humm, I honestly did not even know that Icinga had a REST based check for CPU (before they used the client option)... So do you have any idea what it does?

The "legacy API" which I assume it is using (as the new one is not even complete yet) should not have changed at all. I know prior to 0.5.2.20 the authentication was broken, but that has been fixed.

And if something "stopped working" I assume it is related to the configuration. Could you paste relevant bits of the NSClient++ config (and validate that passwords and ports seem correct)

Tontonitch commented 6 years ago

Hello Michael,

Humm, I honestly did not even know that Icinga had a REST based check for CPU (before they used the client option)...
>So do you have any idea what it does?

Icinga2 developers have played with this legacy API since the end of 2016 and it is officially available since august of the year (with icinga2 2.7.0) through the check_nscp_api plugin, to deal with runtime metrics such as cpu usage and windows event logs. https://www.icinga.com/2016/09/16/nsclient-0-5-0-rest-api-and-icinga-2-integration/ https://www.icinga.com/2017/07/05/monitoring-windows-clients-with-icinga-2-and-local-nsclient-checks/ https://www.icinga.com/2017/08/02/icinga-2-v2-7-0-released/

Icinga2 2.7.1 and then 2.8.0 produced and completed the related documentation: https://www.icinga.com/docs/icinga2/snapshot/doc/06-distributed-monitoring/#nsclient-with-check_nscp_api

This plugin is already listed in the nscp documentation, I guest it was a contribution to your doc by the Icinga2 core dev team. https://docs.nsclient.org/api/#integrations

The "legacy API" which I assume it is using (as the new one is not even complete yet) should not have changed at all. I know prior to 0.5.2.20 the authentication was broken, but that has been fixed. And if something "stopped working" I assume it is related to the configuration.
>Could you paste relevant bits of the NSClient++ config (and validate that passwords and ports seem correct)

Password and port are correct. My configuration did not change. I use it with NSCP 0.5.0.62 on servers for which I don’t face any issue (other than some know issues like false-positive error messages, fixed in 5.1.x)

I will attach it asap.

But what I’m really worry about is the fact that, on a server where NSCP 0.5.0.62 works correctly, if I install the 0.5.1.46 the issue starts to occur, and then I cannot rollback as even if I put the old version back the issue persists. I really don't understand what's going on.

Tontonitch commented 6 years ago

nsclient.ini.txt

mickem commented 6 years ago

It is an official API, so it should work for sure... I was just unaware they used it :)

Are the passwords different?

(there are two in the file)

Tontonitch commented 6 years ago

Passwords are the same

Tontonitch commented 6 years ago

It is an official API, so it should work for sure... I was just unaware they used it :)

Hope that the info provided are interesting to you. And as I understood from the documentation, checks via the check_nscp_api plugin, so querying the REST API (Legacy API currently), is the way the icinga2 dev team goes, as much flexible and not limited as the "nscp client" command is.

Maybe you will meet some Icinga2 developers during the coming OSMC. There is a pending task about the future NSCP version integrated with Icinga2 (github task https://github.com/Icinga/icinga2/issues/5633).

Tontonitch commented 6 years ago

Hello Michael,

Any news about at least the frozen stats situation appeared since the upgrade of nscp?

While trying to find a good icinga2 client <-> nscp integration, I still face this frozen stats issue.

An exemple with check_nrpe requests: as you can see after a nscp service restart, it gives good and changing values, and after some seconds it freezes, returning always the same value.

[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666
I (0.5.1.46 2017-09-24) seem to be doing fine...
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
WARNING: 5s: 84%|'total 5m'=0%;80;90 'total 1m'=32%;80;90 'total 5s'=84%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90
[root@monitorsrv1 plugins]# ./check_nrpe -H xxxxxxxxxxxxx -p 5666 -c check_cpu
OK: CPU load is ok.|'total 5m'=0%;80;90 'total 1m'=39%;80;90 'total 5s'=69%;80;90

What could I try to fix that?

BR, Yannick

Tontonitch commented 6 years ago

Hello Michael,

I have a working configuration now:

It seems that after

... the returned stats are ok.

I'm testing to see which change and/or action produced the issue.

Tontonitch commented 6 years ago

By the way, is there a way to keep real-time stats across NSCP restarts, to avoid the following drops in the cpu usage stats for example? image

Edit: opened a separate "issue" as it is not related to this issue (#555)

Tontonitch commented 6 years ago

Hello Michael,

I faced again the issue, even with NSCP 0.5.0.62 (bundled with Icinga2). Restarting the NSCP service fixed the issue, but for how long? What is bad is that, as the cpu check returned wrong values, we were not notified about an important issue.

image

BR, Yannick

Tontonitch commented 6 years ago

For your information, I had 10 servers impacted. Restarting the service fixed the issue for the moment.

mickem commented 6 years ago

please open a ticket about keeping cpu stats across restarts.

As for the next work issue I dont have access to icinga client myself, but I have fixed a rest issue in the next version so please see if that resolves it here: https://github.com/mickem/nscp/releases/tag/0.5.3.3