nickfloyd / newrelic-perfmon-plugin

The Perfmon Plugin for the New Relic Plugins - https://newrelic.com/plugins
MIT License
11 stars 22 forks source link

Typeperf Results in Incorrect Counter Values #11

Open dmarchelya opened 10 years ago

dmarchelya commented 10 years ago

Certain conditions will cause the plugin to report bad values for the performance counter, or other unrelated performance counters.

For instance, the following counter will typically cause incorrect reporting for other counter values when included for reporting:

\HTTP Service Request Queues(*)\MaxQueueItemAge

Tested on multiple machines where there are multiple IIS sites configured, typeperf will return a mismatch on the number of counter names and counter values in the results.

Here is an example of the output from this typeperf command:

typeperf "\HTTP Service Request Queues(*)\MaxQueueItemAge"

"(PDH-CSV 4.0)","\\ABCD1234\HTTP Service Request Queues(???3)\MaxQueueItemAge","
\\ABCD1234\HTTP Service Request Queues(???2)\MaxQueueItemAge","\\ABCD1234\HTTP S
ervice Request Queues(???1)\MaxQueueItemAge"
"10/25/2013 11:00:39.984","-1","-1","-1","-1","-1","-1","-1","-1","-1","-1","-1"
,"-1","-1","-1","-1","-1","-1","-1","-1","0.000000","0.000000","0.000000"

typeperf is deciding not to report the name of the counters, who's values are -1, creating a name/value mismatch. The actual values for the three reported instances should be 0, not -1.

Running the perfmon gui, to verify the results, shows that there really are additional instance counters for each website, but that they are not reporting any performance counter data.

This will, at a minimum, cause the HTTP Service Request Queues instance values to incorrectly report as -1, when they are in fact 0. Worse, when multiple counters are included on the same thread, this can result in an incorrect value for the other counters on the thread, creating bad data that is unreliable.

For instance, start by setting the number of threads for the plugin to 1 in the perfmon_metrics.rb file (more than 1 thread obscures the issue, but it is still present).

typeperf  "\HTTP Service Request Queues(*)\MaxQueueItemAge"
"\Processor(0)\% Processor Time" -sc 1

"(PDH-CSV 4.0)","\\ABCD1234\HTTP Service Request Queues(???3)\MaxQueueItemAge","
\\ABCD1234\HTTP Service Request Queues(???2)\MaxQueueItemAge","\\ABCD1234\HTTP S
ervice Request Queues(???1)\MaxQueueItemAge","\\ABCD1234\Processor(0)\% Processo
r Time"
"10/25/2013 11:25:29.601","-1","-1","-1","-1","-1","-1","-1","-1","-1","-1","-1"
,"-1","-1","-1","-1","-1","-1","-1","-1","0.000000","0.000000","0.000000","3.664
721"

Because the plugin maps names and values based on index, the values for all of the counters are reported as -1, which is not correct for any of the reported counters.

The current workaround is not to specify all instances (*) for Http Service Request Queues, or to remove the MaxQueueItemAge counter entirely. This also means that every desired performance counter must be verified with typeperf to have a matching count of names and values before it is used, if the data is to be trusted.

Worse though, is that from time to time a null or -1 value is reported intermittently for other counters that are typically reliable, making it difficult to determine when a performance counter change is a result of a valid change, or from another misreporting counter.

Possible Solutions / Workarounds:

soheilpro commented 9 years ago

In my application, I worked around this issue by recording the index of all missing counters (columns with a value of "-1" in the first line) and then removing those columns from all subsequent lines.