Nagios XI service checks via rrd return zero values cluster disks

ALSICteam commented 4 years ago

Issue and Steps to Reproduce

Hi,

We are experiencing problems with performance counter checks via rrd config for average values (over 5 minutes / 10 minutes / 15 minutes). For instance, there's a Microsoft failover cluster (2 Windows servers) with the following config:

2 local disks
1 clustered disk that is only mouted at one server, but can failover to the other server.

When configuring the performance checks via rrd config for average values, it runs fine on the server with the clustered disk mounted. But the server where the clustered disk is not mounted, the following error is thrown and all other service checks which involve performance counters averages are returning zero. I know it's normal that a performance counter throws an error about a disk that isn't found, but why should all the other performance counters return zero also?

Has by any chance anyone experienced the same issue and found a way to resolve this? Thanks in advance.

Nagios Xi configuration example
Command = $USER1$/check_nrpe -2 -H $HOSTADDRESS$ -t 90 -c $ARG1$ $ARG2$
$ARG1$ = check_pdh
$ARG2$ = -a 'counter:N: % Write Time=PercentDiskNWriteTimeAvg' 'critical=value>100' 'perf-config=*(suffix:none)' 'time=5m'

Local server configuration in nsclient.ini
[/settings/system/windows/counters/PercentDiskNWriteTimeAvg]
; ---------------------------------------------
counter=\LogicalDisk(N:)\% Disk Write Time
collection strategy=rrd
buffer size=1h

Expected Behavior

Process the other performance counters and results.

Actual Behavior

Error message 2020-01-10 09:16:07: error:c:\source\master\modules\CheckSystem\pdh_thread.cpp:247: Failed to query performance counters: PercentDiskNReadTimeAvg Failed to poll counter \LogicalDisk(N:)\% Disk Read Time: c0000bc6: The data is not valid.

Other performance counters and results are returning 0

Details

NSClient++ version: 0.5.0062
OS and Version: Windows Server 2016
Checking from: Nagios XI 5.5.7
Checking with: check_nrpe

Additional Details

NSClient++ log: 2020-01-10 09:16:07: error:c:\source\master\modules\CheckSystem\pdh_thread.cpp:247: Failed to query performance counters: PercentDiskNReadTimeAvg Failed to poll counter \LogicalDisk(N:)\% Disk Read Time: c0000bc6: The data is not valid.

mintsoft commented 4 years ago

@ALSICteam My goto for any performance counter problem is the stuff here first for any general counter fixes, https://docs.nsclient.org/faq/#13-failed-to-open-performance-counters

I know it's normal that a performance counter throws an error about a disk that isn't found, but why should all the other performance counters return zero also?

Many performance counters return 0 as a code if the performance counter doesn't actually exist, it's a 'feature' of performance counters I think.

We had a similiar problem and ended up using check_nsc_web with a script to detect what harddisks are attached then use wmi to output the data. I think these days (under 0.5.2.41 for example) you might be better off using CheckDisk rather than counters (https://docs.nsclient.org/reference/windows/CheckDisk/) that said, I've not actually tried that. YMMV

mickem commented 4 years ago

0.5.0 is a rather old version so I would start by upgrading. But in general if loading predefined counters fail the checks will be disabled. You can work around this by not loading the counter in question instead checking it on the demand (i.e. with the command).

Please reopen this ticket if this does not resolve you issue

ALSICteam commented 4 years ago

Hello,

Disabeling resolves the performance monitor checks that had problems (zero values), but the mentioned performance counter checks are needed for failover clusters so checking on demand is not really an option, seen as you need to activate them in the nsclient.ini file for RRD usage.

mickem / nscp