Closed synbit closed 7 years ago
To diagnose I need:
Without it I cant do much, but a lot has happened between 0.4.1 and 0.4.4
Most issues I have heard about CPU spikes have been related to check_eventlog and large logs (which can be mitigated with various options).
// Michael Medin
Thanks for your reply,
I've increased the logging level to DEBUG and restarted the service on the four standby servers (these are the servers from the first "IIS Cluster CPU Usage" graph found on my first comment/report of the issue). Attached are the respective files.
In general we use either check_nt
when there is a Windows performance counter, or check_nrpe
combined with one of our own scripts. All of these should be defined in the attached configuration files.
If there is anything else I can do to help please let me know.
Just an update,
DEBUG logging is enabled since yesterday; more than enough for all the checks to run multiple times against the servers. Attached are the log files for the respective servers.
The updated graphs:
The RRD data are averaged out by definition and the graphs above do not accurately represent the CPU spikes. The following graph contains data which are still averaged out, but the sampling rate is slightly better (x
axis is time, y
axis is %CPU
usage):
Thank you very much for your help! LogIIS01.zip LogIIS02.zip LogIIS03.zip LogIIS04.zip
Seems to be a lot of traffic on these machines... How many checks per second do they have?
For instance:
2016-02-24 14:55:16: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:16: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:16: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:16: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:16: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
2016-02-24 14:55:17: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: 10.1.1.102
Seems you have almost 17 requests per second?
There are 40 checks in total and the minimum check interval is one minute. This is one check per 1.5 seconds. However, I am not sure what is the exact batch size that Nagios is using when scheduling these checks. My calculation above is only an average.
It is also important to note that no new checks have been added after the upgrade though. Let me know if you want me to remove a few checks and see if that makes any difference.
Thanks for your help.
I just noticed an issue where the webserver would fail if the port was in use and start a busy wait. Please check if 0.5.0.42 and see if that behaves better...
Thank you for the update. I'll install 0.5.0.42 on a couple of servers and I'll see if that's better.
Version 0.5.0.45 is installed on one IIS server for several days now. The graphs below show that the spikes are less frequent but still present and as severe as they were before. The rest of the IIS servers are on version 0.4.1.73.
This server had no added checks, however I'll be happy to enable debug logging and provide those logs if required.
Hello,
In case this helps with this issue, I installed 0.5.0.62 today in order to see if the issue went away in the latest version. Unfortunately this is still bad given that these boxes when under load will be using ~80% CPU (please refer to the graphs below). It is worth mentioning that the logs are clear (like before) and there are no extra checks added to this server since I tested 0.5.0.45. Is it worth enabling debug logging again?
In order to see if the problem relates to the number of checks performed against a host, I removed all checks from iis001, leaving only the CPU to monitor. Still same behaviour as can be seen below.
Can you post your nsclient.ini? If I find some time I'll see if I can reproduce what you are seeing. Just one more question. Are you seeing this behaviour only on iis server or on all servers?
@willemdh it's on all IIS Servers when upgraded
Please see below for our nsclient.ini:
# If you want to fill this file with all avalible options run the following command:
# nscp settings --generate ini --add-defaults --load-all
; Undocumented section
[/modules]
; CheckDisk - CheckDisk can check various file and disk related things. The current version has commands to check Size of hard drives and directories.
CheckDisk = 1
; Event log Checker. - Check for errors and warnings in the event log. This is only supported through NRPE so if you plan to use only NSClient this wont help you at all.
CheckEventLog = 1
; Check External Scripts - A simple wrapper to run external scripts and batch files.
CheckExternalScripts = 1
; Helper function - Various helper function to extend other checks. This is also only supported through NRPE.
CheckHelpers = 1
; Check NSCP - Checkes the state of the agent
CheckNSCP = 1
; CheckSystem - Various system related checks, such as CPU load, process state, service state memory usage and PDH counters.
CheckSystem = 1
; CheckWMI - CheckWMI can check various file and disk related things. The current version has commands to check Size of hard drives and directories.
CheckWMI = 1
; NRPE server - A simple server that listens for incoming NRPE connection and handles them.
NRPEServer = 1
; NSClient server - A simple server that listens for incoming NSClient (check_nt) connection and handles them. Although NRPE is the preferred method NSClient is fully supported and can be used for simplicity or for compatibility.
NSClientServer = 1
; Undocumented section
[/settings/default]
; ALLOWED HOSTS - A comaseparated list of allowed hosts. You can use netmasks (/ syntax) or * to create ranges.
allowed hosts = xxx.xxx.xxx.xxx
; A list of aliases available. An alias is an internal command that has been "wrapped" (to add arguments). Be careful so you don't create loops (ie check_loop=check_a, check_a=check_loop)
[/settings/external scripts/alias]
; alias_cpu - Alias for alias_cpu. To configure this item add a section called: /settings/external scripts/alias/alias_cpu
alias_cpu = checkCPU warn=80 crit=90 time=5m time=1m time=30s
; alias_cpu_ex - Alias for alias_cpu_ex. To configure this item add a section called: /settings/external scripts/alias/alias_cpu_ex
alias_cpu_ex = checkCPU warn=$ARG1$ crit=$ARG2$ time=5m time=1m time=30s
; alias_disk - Alias for alias_disk. To configure this item add a section called: /settings/external scripts/alias/alias_disk
alias_disk = CheckDriveSize MinWarn=10% MinCrit=5% CheckAll FilterType=FIXED
; alias_disk_loose - Alias for alias_disk_loose. To configure this item add a section called: /settings/external scripts/alias/alias_disk_loose
alias_disk_loose = CheckDriveSize MinWarn=10% MinCrit=5% CheckAll FilterType=FIXED ignore-unreadable
; alias_event_log - Alias for alias_event_log. To configure this item add a section called: /settings/external scripts/alias/alias_event_log
alias_event_log = CheckEventLog file=application file=system MaxWarn=1 MaxCrit=1 "filter=generated gt -2d AND severity NOT IN ('success', 'informational') AND source != 'SideBySide'" truncate=800 unique descriptions "syntax=%severity%: %source%: %message% (%count%)"
; alias_file_age - Alias for alias_file_age. To configure this item add a section called: /settings/external scripts/alias/alias_file_age
alias_file_age = checkFile2 filter=out "file=$ARG1$" filter-written=>1d MaxWarn=1 MaxCrit=1 "syntax=%filename% %write%"
; alias_file_size - Alias for alias_file_size. To configure this item add a section called: /settings/external scripts/alias/alias_file_size
alias_file_size = CheckFiles "filter=size > $ARG2$" "path=$ARG1$" MaxWarn=1 MaxCrit=1 "syntax=%filename% %size%" max-dir-depth=10
; alias_mem - Alias for alias_mem. To configure this item add a section called: /settings/external scripts/alias/alias_mem
alias_mem = checkMem MaxWarn=80% MaxCrit=90% ShowAll=long type=physical type=virtual type=paged type=page
; alias_process - Alias for alias_process. To configure this item add a section called: /settings/external scripts/alias/alias_process
alias_process = checkProcState "$ARG1$=started"
; alias_process_count - Alias for alias_process_count. To configure this item add a section called: /settings/external scripts/alias/alias_process_count
alias_process_count = checkProcState MaxWarnCount=$ARG2$ MaxCritCount=$ARG3$ "$ARG1$=started"
; alias_process_hung - Alias for alias_process_hung. To configure this item add a section called: /settings/external scripts/alias/alias_process_hung
alias_process_hung = checkProcState MaxWarnCount=1 MaxCritCount=1 "$ARG1$=hung"
; alias_process_stopped - Alias for alias_process_stopped. To configure this item add a section called: /settings/external scripts/alias/alias_process_stopped
alias_process_stopped = checkProcState "$ARG1$=stopped"
; alias_sched_all - Alias for alias_sched_all. To configure this item add a section called: /settings/external scripts/alias/alias_sched_all
alias_sched_all = CheckTaskSched "filter=exit_code ne 0" "syntax=%title%: %exit_code%" warn=>0
; alias_sched_long - Alias for alias_sched_long. To configure this item add a section called: /settings/external scripts/alias/alias_sched_long
alias_sched_long = CheckTaskSched "filter=status = 'running' AND most_recent_run_time < -$ARG1$" "syntax=%title% (%most_recent_run_time%)" warn=>0
; alias_sched_task - Alias for alias_sched_task. To configure this item add a section called: /settings/external scripts/alias/alias_sched_task
alias_sched_task = CheckTaskSched "filter=title eq '$ARG1$' AND exit_code ne 0" "syntax=%title% (%most_recent_run_time%)" warn=>0
; alias_service - Alias for alias_service. To configure this item add a section called: /settings/external scripts/alias/alias_service
alias_service = checkServiceState CheckAll
; alias_service_ex - Alias for alias_service_ex. To configure this item add a section called: /settings/external scripts/alias/alias_service_ex
alias_service_ex = checkServiceState CheckAll "exclude=Net Driver HPZ12" "exclude=Pml Driver HPZ12" exclude=stisvc
; alias_up - Alias for alias_up. To configure this item add a section called: /settings/external scripts/alias/alias_up
alias_up = checkUpTime MinWarn=1d MinWarn=1h
; alias_updates - Alias for alias_updates. To configure this item add a section called: /settings/external scripts/alias/alias_updates
alias_updates = check_updates -warning 0 -critical 0
; alias_volumes - Alias for alias_volumes. To configure this item add a section called: /settings/external scripts/alias/alias_volumes
alias_volumes = CheckDriveSize MinWarn=10% MinCrit=5% CheckAll=volumes FilterType=FIXED
; alias_volumes_loose - Alias for alias_volumes_loose. To configure this item add a section called: /settings/external scripts/alias/alias_volumes_loose
alias_volumes_loose = CheckDriveSize MinWarn=10% MinCrit=5% CheckAll=volumes FilterType=FIXED ignore-unreadable
; check_ok - Alias for check_ok. To configure this item add a section called: /settings/external scripts/alias/check_ok
;check_ok = CheckOK Everything is fine!
; default - Alias for default. To configure this item add a section called: /settings/external scripts/alias/default
;default =
; Section for NRPE (NRPEListener.dll) (check_nrpe) protocol options.
[/settings/NRPE/server]
allowed ciphers = ALL
; Section for NSClient (NSClientServer.dll) (check_nt) protocol options.
[/settings/NSClient/server]
; Section for system checks and system settings
[/settings/check/system/windows]
; Confiure which services has to be in which state
[/settings/check/system/windows/service mapping]
; Configure crash handling properties.
[/settings/crash]
; Section for the EventLog Checker (CheckEventLog.dll).
[/settings/eventlog]
; A set of options to configure the real time checks
[/settings/eventlog/real-time]
; A set of filters to use in real-time mode
[/settings/eventlog/real-time/filters]
; Section for external scripts configuration options (CheckExternalScripts).
[/settings/external scripts]
command timeout = 60
[/settings/external scripts/scripts]
; cw_check_app_pool_mem_allium - Alias for cw_check_app_pool_mem_allium. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_allium
cw_check_app_pool_mem_allium = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs allium
; cw_check_app_pool_mem_controlpanel - Alias for cw_check_app_pool_mem_controlpanel. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_controlpanel
cw_check_app_pool_mem_controlpanel = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs controlpanel
; cw_check_app_pool_mem_datamining - Alias for cw_check_app_pool_mem_datamining. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_datamining
cw_check_app_pool_mem_datamining = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs datamining
; cw_check_app_pool_mem_organami - Alias for cw_check_app_pool_mem_organami. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_organami
cw_check_app_pool_mem_organami = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs organami
; cw_check_app_pool_mem_paymentsearch - Alias for cw_check_app_pool_mem_paymentsearch. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_paymentsearch
cw_check_app_pool_mem_paymentsearch = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs paymentsearch
; cw_check_app_pool_mem_plutus - Alias for cw_check_app_pool_mem_plutus. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_plutus
cw_check_app_pool_mem_plutus = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs plutus
; cw_check_app_pool_mem_propono - Alias for cw_check_app_pool_mem_propono. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_propono
cw_check_app_pool_mem_propono = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs propono
; cw_check_app_pool_mem_vehicledata - Alias for cw_check_app_pool_mem_vehicledata. To configure this item add a section called: /settings/external scripts/scripts/cw_check_app_pool_mem_vehicledata
cw_check_app_pool_mem_vehicledata = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs vehicledata
cw_check_app_pool_mem_services=cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs services.codeweavers.net
cw_check_app_pool_mem_ycmi2=cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs ycmi2
cw_check_app_pool_mem_utilitas=cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs utilitas
cw_check_app_pool_mem_orchestrator = cscript.exe /NoLogo /T:10 scripts\\check_iis_app_pool_private_bytes.vbs orchestrator
cw_iis_app_pool_cpu=cmd /c echo scripts/cw_iis_app_pool_cpu_time.ps1; exit($lastexitcode) | powershell.exe -command -
cw_check_timewait_conn=cmd /c echo scripts/cw_check_timewait_conn.ps1; exit($lastexitcode) | powershell.exe -command -
cw_check_pending_requests=cmd /c echo scripts/check_iis_pending_request_count.ps1; exit($lastexitcode) | powershell.exe -command -
cw_check_webconfig_version=cmd /c echo scripts/cw_check_webconfig_versions.ps1; exit($lastexitcode) | powershell.exe -command -
cw_check_dotNet_jit_bytes_second=cmd /c powershell -command "`c:\Progra~1\NSClient++\scripts\cw_check_il_bytes_jitted.ps1 ; exit ($lastexitcode)"
; default - Alias for default. To configure this item add a section called: /settings/external scripts/scripts/default
;default =
; A list of wrappped scripts (ie. using the template mechanism)
[/settings/external scripts/wrapped scripts]
; A list of templates for wrapped scripts
[/settings/external scripts/wrappings]
; Section for configuring the log handling.
[/settings/log]
; Configure log file properties.
[/settings/log/file]
; MAXIMUM FILE SIZE - When file size reaches this it will be truncated to 50% if set to 0 (default) truncation will be disabled
max size = 10485760
; Section for configuring the shared session.
[/settings/shared session]
; A list of avalible remote target systems
[/settings/targets]
Please respond to my question @mintsoft
Are you seeing this behaviour only on iis server or on all servers?
So on non-iis servers, eg a fileserver or so, you don't have this problem?
@willemdh No, we started the upgrade from our IIS servers. I'll try another 2012 R2 and I'll let you know. Are you suspecting that this is an IIS-only issue?
I'm not suspecting anything.. But if I want to try reproduce, I would need to know..
That's fine; I'll update you as soon as I've collected enough data.
In the meantime, if you want to, you could try reproducing the problem on an IIS server (still Windows Server 2012 R2). Install 0.4.1.73 first, leave it for a day or so, then upgrade to 0.5.0.62 or something. Then have a look at the graph (assuming that you were plotting the data) and see if you get similar behaviour with CPU spikes etc.
Thanks for your help!
Fyi, I'm running 0.4.1.105 on most servers. So I'll show the difference between 0.4.1.105 and the latest 0.5.x. (Could take some time)
No problem, collecting data takes time usually. I haven't tried 0.4.0.105 but as long as you don't get CPU spikes already, it's still a valid experiment :)
Hello, here is another update...
So, I installed 0.5.0.62 on a non-IIS server and as seen in the graph below, the problem is still there: And for comparison purposes, here is the CPU graph of the same server just before the upgrade:
@willemdh Have you managed to reproduce it?
Is the server processing checks or is nsclient++ idling?
To diagnose this issue we need to understand what is infact causing the CPU issues, from 0.4.x pretty much everything has changed so it is impossible to diagnose...
So please start by having a clean install of NSClient++ without any checks and see if the issue still persists. If tat is the case please do let me know the EMPTY config and some what is installed on the machine so I can see if I can reproduce this.
if the error is not reproducable with a clean NSClient++ we need to figure out what is causing the issue, which requires adding one check at a time...
We have installed Windows Server 2012 R2 on a single core VM. We did not give any roles to the server; we only installed NSClient. The steps/results are shown below.
I appreciate that the time window shown on each graph is not long enough, however we observed the server for more than 10 minutes and as far as I can tell this graph is representative of what's going on.
Regarding the logs, there is only one line throughout this test, which I regard irrelevant, but I might be wrong, so here it is:
2016-10-26 10:50:02: error:c:\source\master\modules\NRPEServer\NRPEServer.cpp:123: CA not found: C:\Program Files\NSClient++/security/ca.pem (generating a default CA)
In case this is not visible, I can confirm that NSCP is idling on CPU as @mickem said. The data I have provided above have been generated using the stock configuration file and with no checks added at all.
Please let us know if there is anything else we can do to help.
We have gone through the modules in the default configuration on this test server and we switched them off one at a time. We found that by disabling the CheckSystem
module (that is setting CheckSystem = 0
) the CPU spikes disappeared.
Any thoughts?
In an attempt to identify the version of NSClient which appears to have different behaviour, we did the following things (on a live server, as we have found that the issue is more obvious when there is load):
0.4.1.73
to 0.4.2.114
0.4.2.114
to 0.4.3.143
0.4.3.143
to 0.4.1.73
The following graphs suggest that the different behaviour was probably introduced in 0.4.2.x
.
And this is a cluster-wide graph which may help identify what is regarded as normal CPU usage:
Hello again,
In order to rule out the possibility of check_cpu
reporting the wrong data (something that we noticed it has happened in previous versions), we added the following CPU check which comes from Windows Performance Counters (our assumption here is that if there is something that accurately represents the CPU usage, that would be the Windows Counters):
check_nt -H iis101 -p 12489 -v COUNTER -l "\\Processor(_Total)\\% Processor Time","Utilised","%"
The above check does not calculate the average; it returns the CPU usage at the time the check was executed. Both checks run against the server every minute (that is, the check interval is the same). We collected some data for a while and we present the data below.
IIS101
was upgraded from 0.4.1.73 to 0.4.4.23 and the behaviour is noticeable (see CPU spikes; more interestingly see how CPU drops under 5% occasionally. The accuracy here is questionable given that these are live boxes:
Same server (IIS101
), CPU Load check:
Same server (IIS101
), Windows Performance Counter (CPU):
Going back to the cluster view, after downgrading nscp
on IIS101
from 0.4.4.23
to 0.4.1.73
:
Same server (IIS101
), check CPU Load, after downgrading to 0.4.1.73
:
Same server (IIS101
), Windows Performance Counter (CPU), after downgrading to 0.4.1.73
:
We are running out of ideas here; what else can we do to prove different behaviour?
Sorry for being so noisy; we think the reason for the CPU spikes in greater versions than 0.4.1.73
is because the unit in the following command:
./check_nt -H 10.2.1.203 -p 12489 -v CPULOAD -l 1,85,95
has changed despite the output which suggests that number 1
, after -l
is in minutes. This appears to be in seconds. The check above gave us the following output, on a dual-core server:
CPU Load 50% (1 min average) | '1 min avg Load'=50%;85;95;0;100
This was immediately after we run the following PowerShell:
While($True) {}
This suggests that this is not a 1min
average. In order to get 1min
average, this is the correct command:
./check_nt -H 10.2.1.203 -p 12489 -v CPULOAD -l 60,85,95
And again, the output is misleading as it suggests that the results are averaged out over 60min
instead of 60sec
:
CPU Load 4% (60 min average) | '60 min avg Load'=4%;85;95;0;100
The equivalent nrpe
command (see below) seems to be working fine; presumably the problem is with check_nt
and CPULOAD
.
./check_nrpe -H 10.2.1.203 -c alias_cpu
Can anyone confirm that please?
Please see if this version resolves the issue: https://github.com/mickem/nscp/releases/tag/0.5.1.22
Hello,
Windows Server 2012 R2 Standard
0.4.4.19
The following graph shows CPU usage on four IIS Servers. These servers are currently standby and are not serving any requests. I have not downgraded NSClient on these servers for debugging purposes.
The graph below shows CPU usage on four IIS Servers which are actually "live" as they respond to requests. The CPU spikes were causing problems for the applications running on these servers and we had to downgrade to 0.4.1.73 unfortunately.
I cannot be certain that this is related, but one of our performance counters (Bytes Jitted/sec), showed that something changed after the upgrade.
Just for reference, the check for the above counter is:
check_nt -H iis14 -p 12489 -v COUNTER -l "\\.NET CLR Jit(_Global_)\\IL Bytes Jitted / sec","IL Bytes Jitted/sec","Bytes/s"
I haven't got more data to provide at this stage, but let me know if there is something I can do to help.
Many thanks!