mickem / nscp

NSClient++
http://nsclient.org
GNU General Public License v2.0
244 stars 94 forks source link

CPU Issues after upgrading NSClient++ from 0.4.1.73 to 0.4.4.19 #251

Closed synbit closed 7 years ago

synbit commented 8 years ago

Hello,

The following graph shows CPU usage on four IIS Servers. These servers are currently standby and are not serving any requests. I have not downgraded NSClient on these servers for debugging purposes.

04173_04419_upgrade_cpuusage

The graph below shows CPU usage on four IIS Servers which are actually "live" as they respond to requests. The CPU spikes were causing problems for the applications running on these servers and we had to downgrade to 0.4.1.73 unfortunately.

04173_04419_upgrade_cpuusage_live

I cannot be certain that this is related, but one of our performance counters (Bytes Jitted/sec), showed that something changed after the upgrade.

04173_04419_upgrade_bytesjitted

Just for reference, the check for the above counter is:

check_nt -H iis14 -p 12489 -v COUNTER -l "\\.NET CLR Jit(_Global_)\\IL Bytes Jitted / sec","IL Bytes Jitted/sec","Bytes/s"

I haven't got more data to provide at this stage, but let me know if there is something I can do to help.

Many thanks!

mickem commented 8 years ago

To diagnose I need:

Without it I cant do much, but a lot has happened between 0.4.1 and 0.4.4

Most issues I have heard about CPU spikes have been related to check_eventlog and large logs (which can be mitigated with various options).

// Michael Medin

synbit commented 8 years ago

Thanks for your reply,

I've increased the logging level to DEBUG and restarted the service on the four standby servers (these are the servers from the first "IIS Cluster CPU Usage" graph found on my first comment/report of the issue). Attached are the respective files.

In general we use either check_nt when there is a Windows performance counter, or check_nrpe combined with one of our own scripts. All of these should be defined in the attached configuration files.

If there is anything else I can do to help please let me know.

iis01.zip iis02.zip iis03.zip iis04.zip

synbit commented 8 years ago

Just an update,

DEBUG logging is enabled since yesterday; more than enough for all the checks to run multiple times against the servers. Attached are the log files for the respective servers.

The updated graphs: moredatafromstandbyiisservers

The RRD data are averaged out by definition and the graphs above do not accurately represent the CPU spikes. The following graph contains data which are still averaged out, but the sampling rate is slightly better (x axis is time, y axis is %CPU usage): bettergranularity

Thank you very much for your help! LogIIS01.zip LogIIS02.zip LogIIS03.zip LogIIS04.zip

mickem commented 8 years ago

Seems to be a lot of traffic on these machines... How many checks per second do they have?

mickem commented 8 years ago

For instance:

2016-02-24 14:55:16: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:16: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:16: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:16: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:16: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102

Seems you have almost 17 requests per second?

synbit commented 8 years ago

There are 40 checks in total and the minimum check interval is one minute. This is one check per 1.5 seconds. However, I am not sure what is the exact batch size that Nagios is using when scheduling these checks. My calculation above is only an average.

It is also important to note that no new checks have been added after the upgrade though. Let me know if you want me to remove a few checks and see if that makes any difference.

Thanks for your help.

mickem commented 8 years ago

I just noticed an issue where the webserver would fail if the port was in use and start a busy wait. Please check if 0.5.0.42 and see if that behaves better...

synbit commented 8 years ago

Thank you for the update. I'll install 0.5.0.42 on a couple of servers and I'll see if that's better.

synbit commented 8 years ago

Version 0.5.0.45 is installed on one IIS server for several days now. The graphs below show that the spikes are less frequent but still present and as severe as they were before. The rest of the IIS servers are on version 0.4.1.73.

nsclient_v0 5 0 45

This server had no added checks, however I'll be happy to enable debug logging and provide those logs if required.

synbit commented 8 years ago

Hello,

In case this helps with this issue, I installed 0.5.0.62 today in order to see if the issue went away in the latest version. Unfortunately this is still bad given that these boxes when under load will be using ~80% CPU (please refer to the graphs below). It is worth mentioning that the logs are clear (like before) and there are no extra checks added to this server since I tested 0.5.0.45. iis001_nscp_05062_cpu iis001_nscp_05062_cpu_clusterview Is it worth enabling debug logging again?

synbit commented 8 years ago

In order to see if the problem relates to the number of checks performed against a host, I removed all checks from iis001, leaving only the CPU to monitor. Still same behaviour as can be seen below.

iis001_cpu_onecheck

willemdh commented 8 years ago

Can you post your nsclient.ini? If I find some time I'll see if I can reproduce what you are seeing. Just one more question. Are you seeing this behaviour only on iis server or on all servers?

mintsoft commented 8 years ago

@willemdh it's on all IIS Servers when upgraded

mintsoft commented 8 years ago

Please see below for our nsclient.ini:

# If you want to fill this file with all avalible options run the following command:
# nscp settings --generate ini --add-defaults --load-all

; Undocumented section
[/modules]

; CheckDisk - CheckDisk can check various file and disk related things. The current version has commands to check Size of hard drives and directories.
CheckDisk = 1

; Event log Checker. - Check for errors and warnings in the event log. This is only supported through NRPE so if you plan to use only NSClient this wont help you at all.
CheckEventLog = 1

; Check External Scripts - A simple wrapper to run external scripts and batch files.
CheckExternalScripts = 1

; Helper function - Various helper function to extend other checks. This is also only supported through NRPE.
CheckHelpers = 1

; Check NSCP - Checkes the state of the agent
CheckNSCP = 1

; CheckSystem - Various system related checks, such as CPU load, process state, service state memory usage and PDH counters.
CheckSystem = 1

; CheckWMI - CheckWMI can check various file and disk related things. The current version has commands to check Size of hard drives and directories.
CheckWMI = 1

; NRPE server - A simple server that listens for incoming NRPE connection and handles them.
NRPEServer = 1

; NSClient server - A simple server that listens for incoming NSClient (check_nt) connection and handles them. Although NRPE is the preferred method NSClient is fully supported and can be used for simplicity or for compatibility.
NSClientServer = 1

; Undocumented section
[/settings/default]

; ALLOWED HOSTS - A comaseparated list of allowed hosts. You can use netmasks (/ syntax) or * to create ranges.
allowed hosts = xxx.xxx.xxx.xxx

; A list of aliases available. An alias is an internal command that has been "wrapped" (to add arguments). Be careful so you don't create loops (ie check_loop=check_a, check_a=check_loop)
[/settings/external scripts/alias]

; alias_cpu - Alias for alias_cpu. To configure this item add a section called: /settings/external scripts/alias/alias_cpu
alias_cpu = checkCPU warn=80 crit=90 time=5m time=1m time=30s

; alias_cpu_ex - Alias for alias_cpu_ex. To configure this item add a section called: /settings/external scripts/alias/alias_cpu_ex
alias_cpu_ex = checkCPU warn=$ARG1$ crit=$ARG2$ time=5m time=1m time=30s

; alias_disk - Alias for alias_disk. To configure this item add a section called: /settings/external scripts/alias/alias_disk
alias_disk = CheckDriveSize MinWarn=10% MinCrit=5% CheckAll FilterType=FIXED

; alias_disk_loose - Alias for alias_disk_loose. To configure this item add a section called: /settings/external scripts/alias/alias_disk_loose
alias_disk_loose = CheckDriveSize MinWarn=10% MinCrit=5% CheckAll FilterType=FIXED ignore-unreadable

; alias_event_log - Alias for alias_event_log. To configure this item add a section called: /settings/external scripts/alias/alias_event_log
alias_event_log = CheckEventLog file=application file=system MaxWarn=1 MaxCrit=1 "filter=generated gt -2d AND severity NOT IN ('success', 'informational') AND source != 'SideBySide'" truncate=800 unique descriptions "syntax=%severity%: %source%: %message% (%count%)"

; alias_file_age - Alias for alias_file_age. To configure this item add a section called: /settings/external scripts/alias/alias_file_age
alias_file_age = checkFile2 filter=out "file=$ARG1$" filter-written=>1d MaxWarn=1 MaxCrit=1 "syntax=%filename% %write%"

; alias_file_size - Alias for alias_file_size. To configure this item add a section called: /settings/external scripts/alias/alias_file_size
alias_file_size = CheckFiles "filter=size > $ARG2$" "path=$ARG1$" MaxWarn=1 MaxCrit=1 "syntax=%filename% %size%" max-dir-depth=10

; alias_mem - Alias for alias_mem. To configure this item add a section called: /settings/external scripts/alias/alias_mem
alias_mem = checkMem MaxWarn=80% MaxCrit=90% ShowAll=long type=physical type=virtual type=paged type=page

; alias_process - Alias for alias_process. To configure this item add a section called: /settings/external scripts/alias/alias_process
alias_process = checkProcState "$ARG1$=started"

; alias_process_count - Alias for alias_process_count. To configure this item add a section called: /settings/external scripts/alias/alias_process_count
alias_process_count = checkProcState MaxWarnCount=$ARG2$ MaxCritCount=$ARG3$ "$ARG1$=started"

; alias_process_hung - Alias for alias_process_hung. To configure this item add a section called: /settings/external scripts/alias/alias_process_hung
alias_process_hung = checkProcState MaxWarnCount=1 MaxCritCount=1 "$ARG1$=hung"

; alias_process_stopped - Alias for alias_process_stopped. To configure this item add a section called: /settings/external scripts/alias/alias_process_stopped
alias_process_stopped = checkProcState "$ARG1$=stopped"

; alias_sched_all - Alias for alias_sched_all. To configure this item add a section called: /settings/external scripts/alias/alias_sched_all
alias_sched_all = CheckTaskSched "filter=exit_code ne 0" "syntax=%title%: %exit_code%" warn=>0

; alias_sched_long - Alias for alias_sched_long. To configure this item add a section called: /settings/external scripts/alias/alias_sched_long
alias_sched_long = CheckTaskSched "filter=status = 'running' AND most_recent_run_time < -$ARG1$" "syntax=%title% (%most_recent_run_time%)" warn=>0

; alias_sched_task - Alias for alias_sched_task. To configure this item add a section called: /settings/external scripts/alias/alias_sched_task
alias_sched_task = CheckTaskSched "filter=title eq '$ARG1$' AND exit_code ne 0" "syntax=%title% (%most_recent_run_time%)" warn=>0

; alias_service - Alias for alias_service. To configure this item add a section called: /settings/external scripts/alias/alias_service
alias_service = checkServiceState CheckAll

; alias_service_ex - Alias for alias_service_ex. To configure this item add a section called: /settings/external scripts/alias/alias_service_ex
alias_service_ex = checkServiceState CheckAll "exclude=Net Driver HPZ12" "exclude=Pml Driver HPZ12" exclude=stisvc

; alias_up - Alias for alias_up. To configure this item add a section called: /settings/external scripts/alias/alias_up
alias_up = checkUpTime MinWarn=1d MinWarn=1h

; alias_updates - Alias for alias_updates. To configure this item add a section called: /settings/external scripts/alias/alias_updates
alias_updates = check_updates -warning 0 -critical 0

; alias_volumes - Alias for alias_volumes. To configure this item add a section called: /settings/external scripts/alias/alias_volumes
alias_volumes = CheckDriveSize MinWarn=10% MinCrit=5% CheckAll=volumes FilterType=FIXED

; alias_volumes_loose - Alias for alias_volumes_loose. To configure this item add a section called: /settings/external scripts/alias/alias_volumes_loose
alias_volumes_loose = CheckDriveSize MinWarn=10% MinCrit=5% CheckAll=volumes FilterType=FIXED ignore-unreadable

; check_ok - Alias for check_ok. To configure this item add a section called: /settings/external scripts/alias/check_ok
;check_ok = CheckOK Everything is fine!

; default - Alias for default. To configure this item add a section called: /settings/external scripts/alias/default
;default = 

; Section for NRPE (NRPEListener.dll) (check_nrpe) protocol options.
[/settings/NRPE/server]
allowed ciphers = ALL

; Section for NSClient (NSClientServer.dll) (check_nt) protocol options.
[/settings/NSClient/server]

; Section for system checks and system settings
[/settings/check/system/windows]

; Confiure which services has to be in which state
[/settings/check/system/windows/service mapping]

; Configure crash handling properties.
[/settings/crash]

; Section for the EventLog Checker (CheckEventLog.dll).
[/settings/eventlog]

; A set of options to configure the real time checks
[/settings/eventlog/real-time]

; A set of filters to use in real-time mode
[/settings/eventlog/real-time/filters]

; Section for external scripts configuration options (CheckExternalScripts).
[/settings/external scripts]
command timeout = 60

[/settings/external scripts/scripts]

; cw_check_app_pool_mem_allium - Alias for cw_check_app_pool_mem_allium. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_allium
cw_check_app_pool_mem_allium = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs allium

; cw_check_app_pool_mem_controlpanel - Alias for cw_check_app_pool_mem_controlpanel. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_controlpanel
cw_check_app_pool_mem_controlpanel = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs controlpanel

; cw_check_app_pool_mem_datamining - Alias for cw_check_app_pool_mem_datamining. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_datamining
cw_check_app_pool_mem_datamining = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs datamining

; cw_check_app_pool_mem_organami - Alias for cw_check_app_pool_mem_organami. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_organami
cw_check_app_pool_mem_organami = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs organami

; cw_check_app_pool_mem_paymentsearch - Alias for cw_check_app_pool_mem_paymentsearch. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_paymentsearch
cw_check_app_pool_mem_paymentsearch = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs paymentsearch

; cw_check_app_pool_mem_plutus - Alias for cw_check_app_pool_mem_plutus. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_plutus
cw_check_app_pool_mem_plutus = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs plutus

; cw_check_app_pool_mem_propono - Alias for cw_check_app_pool_mem_propono. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_propono
cw_check_app_pool_mem_propono = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs propono

; cw_check_app_pool_mem_vehicledata - Alias for cw_check_app_pool_mem_vehicledata. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_vehicledata
cw_check_app_pool_mem_vehicledata = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs vehicledata

cw_check_app_pool_mem_services=cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs services.codeweavers.net
cw_check_app_pool_mem_ycmi2=cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs ycmi2
cw_check_app_pool_mem_utilitas=cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs utilitas
cw_check_app_pool_mem_orchestrator = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs orchestrator

cw_iis_app_pool_cpu=cmd /c echo scripts/cw_iis_app_pool_cpu_time.ps1; exit($lastexitcode) | powershell.exe -command -

cw_check_timewait_conn=cmd /c echo scripts/cw_check_timewait_conn.ps1; exit($lastexitcode) | powershell.exe -command -
cw_check_pending_requests=cmd /c echo scripts/check_iis_pending_request_count.ps1; exit($lastexitcode) | powershell.exe -command -
cw_check_webconfig_version=cmd /c echo scripts/cw_check_webconfig_versions.ps1; exit($lastexitcode) | powershell.exe -command -

cw_check_dotNet_jit_bytes_second=cmd /c powershell -command "`c:\Progra~1\NSClient++\scripts\cw_check_il_bytes_jitted.ps1 ; exit ($lastexitcode)"

; default - Alias for default. To configure this item add a section called: /settings/external scripts/scripts/default
;default = 

; A list of wrappped scripts (ie. using the template mechanism)
[/settings/external scripts/wrapped scripts]

; A list of templates for wrapped scripts
[/settings/external scripts/wrappings]

; Section for configuring the log handling.
[/settings/log]

; Configure log file properties.
[/settings/log/file]

; MAXIMUM FILE SIZE - When file size reaches this it will be truncated to 50% if set to 0 (default) truncation will be disabled
max size = 10485760

; Section for configuring the shared session.
[/settings/shared session]

; A list of avalible remote target systems
[/settings/targets]
willemdh commented 8 years ago

Please respond to my question @mintsoft

Are you seeing this behaviour only on iis server or on all servers?

mintsoft commented 8 years ago

@willemdh https://github.com/mickem/nscp/issues/251#issuecomment-252833746 👍

willemdh commented 8 years ago

So on non-iis servers, eg a fileserver or so, you don't have this problem?

synbit commented 8 years ago

@willemdh No, we started the upgrade from our IIS servers. I'll try another 2012 R2 and I'll let you know. Are you suspecting that this is an IIS-only issue?

willemdh commented 8 years ago

I'm not suspecting anything.. But if I want to try reproduce, I would need to know..

synbit commented 8 years ago

That's fine; I'll update you as soon as I've collected enough data.

In the meantime, if you want to, you could try reproducing the problem on an IIS server (still Windows Server 2012 R2). Install 0.4.1.73 first, leave it for a day or so, then upgrade to 0.5.0.62 or something. Then have a look at the graph (assuming that you were plotting the data) and see if you get similar behaviour with CPU spikes etc.

Thanks for your help!

willemdh commented 8 years ago

Fyi, I'm running 0.4.1.105 on most servers. So I'll show the difference between 0.4.1.105 and the latest 0.5.x. (Could take some time)

synbit commented 8 years ago

No problem, collecting data takes time usually. I haven't tried 0.4.0.105 but as long as you don't get CPU spikes already, it's still a valid experiment :)

synbit commented 8 years ago

Hello, here is another update...

So, I installed 0.5.0.62 on a non-IIS server and as seen in the graph below, the problem is still there: ad_cpu_ncsp_0 5 0 62 And for comparison purposes, here is the CPU graph of the same server just before the upgrade: ad_cpu_nscp-0 4 1 73

@willemdh Have you managed to reproduce it?

mickem commented 8 years ago

Is the server processing checks or is nsclient++ idling?

To diagnose this issue we need to understand what is infact causing the CPU issues, from 0.4.x pretty much everything has changed so it is impossible to diagnose...

So please start by having a clean install of NSClient++ without any checks and see if the issue still persists. If tat is the case please do let me know the EMPTY config and some what is installed on the machine so I can see if I can reproduce this.

if the error is not reproducable with a clean NSClient++ we need to figure out what is causing the issue, which requires adding one check at a time...

synbit commented 8 years ago

We have installed Windows Server 2012 R2 on a single core VM. We did not give any roles to the server; we only installed NSClient. The steps/results are shown below.

  1. Fresh install of Windows Server 2012 R2 (NSCP is not installed at this point): singlecoretestserver_nonscp
  2. Install NSCP 0.4.1.073: singlecoretestserver_nscp_0 4 1 073
  3. Upgrade NSCP from 0.4.1.073 to 0.5.0.62: singlecoretestserver_nscp_0 5 0 62

I appreciate that the time window shown on each graph is not long enough, however we observed the server for more than 10 minutes and as far as I can tell this graph is representative of what's going on.

Regarding the logs, there is only one line throughout this test, which I regard irrelevant, but I might be wrong, so here it is:

2016-10-26 10:50:02: error:c:\source\master\modules\NRPEServer\NRPEServer.cpp:123: CA not found: C:\Program Files\NSClient++/security/ca.pem (generating a default CA)

In case this is not visible, I can confirm that NSCP is idling on CPU as @mickem said. The data I have provided above have been generated using the stock configuration file and with no checks added at all.

Please let us know if there is anything else we can do to help.

synbit commented 8 years ago

We have gone through the modules in the default configuration on this test server and we switched them off one at a time. We found that by disabling the CheckSystem module (that is setting CheckSystem = 0) the CPU spikes disappeared.

Any thoughts?

synbit commented 8 years ago

In an attempt to identify the version of NSClient which appears to have different behaviour, we did the following things (on a live server, as we have found that the issue is more obvious when there is load):

  1. Upgrade form 0.4.1.73 to 0.4.2.114
  2. Upgrade from 0.4.2.114 to 0.4.3.143
  3. Downgrade from 0.4.3.143 to 0.4.1.73

The following graphs suggest that the different behaviour was probably introduced in 0.4.2.x. iiscpu And this is a cluster-wide graph which may help identify what is regarded as normal CPU usage: iisclustercpu

synbit commented 8 years ago

Hello again,

In order to rule out the possibility of check_cpu reporting the wrong data (something that we noticed it has happened in previous versions), we added the following CPU check which comes from Windows Performance Counters (our assumption here is that if there is something that accurately represents the CPU usage, that would be the Windows Counters): check_nt -H iis101 -p 12489 -v COUNTER -l "\\Processor(_Total)\\% Processor Time","Utilised","%" The above check does not calculate the average; it returns the CPU usage at the time the check was executed. Both checks run against the server every minute (that is, the check interval is the same). We collected some data for a while and we present the data below.

IIS101 was upgraded from 0.4.1.73 to 0.4.4.23 and the behaviour is noticeable (see CPU spikes; more interestingly see how CPU drops under 5% occasionally. The accuracy here is questionable given that these are live boxes: iis101clustercpuafterupgradingto_0 4 4 23 Same server (IIS101), CPU Load check: iis101cpuloadafterupgradingto_0 4 4 23 Same server (IIS101), Windows Performance Counter (CPU): iis101windowsperfcountercpuafterupgradingto_0 4 4 23 Going back to the cluster view, after downgrading nscp on IIS101 from 0.4.4.23 to 0.4.1.73: iis101clustercpuafterdowngradingto_0 4 1 73 Same server (IIS101), check CPU Load, after downgrading to 0.4.1.73: iis101cpuloadafterdowngradingto_0 4 1 73 Same server (IIS101), Windows Performance Counter (CPU), after downgrading to 0.4.1.73: iis101windowsperfcountercpuafterdowngradingto_0 4 1 73

We are running out of ideas here; what else can we do to prove different behaviour?

synbit commented 8 years ago

Sorry for being so noisy; we think the reason for the CPU spikes in greater versions than 0.4.1.73 is because the unit in the following command:

./check_nt -H 10.2.1.203 -p 12489 -v CPULOAD -l 1,85,95

has changed despite the output which suggests that number 1, after -l is in minutes. This appears to be in seconds. The check above gave us the following output, on a dual-core server:

CPU Load 50% (1 min average) | '1 min avg Load'=50%;85;95;0;100

This was immediately after we run the following PowerShell:

While($True) {}

This suggests that this is not a 1min average. In order to get 1min average, this is the correct command:

./check_nt -H 10.2.1.203 -p 12489 -v CPULOAD -l 60,85,95

And again, the output is misleading as it suggests that the results are averaged out over 60min instead of 60sec:

CPU Load 4% (60 min average) | '60 min avg Load'=4%;85;95;0;100

The equivalent nrpe command (see below) seems to be working fine; presumably the problem is with check_nt and CPULOAD.

./check_nrpe -H 10.2.1.203 -c alias_cpu

Can anyone confirm that please?

mickem commented 7 years ago

Please see if this version resolves the issue: https://github.com/mickem/nscp/releases/tag/0.5.1.22