poxet / Influx-Capacitor

Influx-capacitor collects metrics from windows machines using Performance Counters. Data is sent to influxDB to be viewable by grafana.
http://influx-capacitor.com
MIT License
44 stars 13 forks source link

Application stops collecting data when counter is changed #65

Closed daniloalsilva closed 8 years ago

daniloalsilva commented 8 years ago

Hello guys!

I'm facing some issues mentioned on #64 #59 and #47 but i'm not sure that occurs by the same reasons. The servers reporting to InfluxDB constantly stops to send data and I can't figure it out why.

At first instance, I've though that this issue was related with some dynamic counters, like the example bellow (this is the entire configuration but the point is the counter Current Connections):

<?xml version="1.0" encoding="utf-8"?>
<Influx-Capacitor>
  <CounterGroups>
    <CounterGroup Name="web-server" SecondsInterval="60" RefreshInstanceInterval="300">
      <Counter>
        <CategoryName>Active Server Pages</CategoryName>
        <CounterName>Requests Queued</CounterName>
        <InstanceName></InstanceName>
    <FieldName>asp_requests_queued</FieldName>
        <Tag>
          <Name>requests_queue</Name>
          <Value>asp_classic</Value>
        </Tag>
      </Counter>
      <Counter>
        <CategoryName>Active Server Pages</CategoryName>
        <CounterName>Requests/Sec</CounterName>
        <InstanceName></InstanceName>
    <FieldName>asp_requests_sec</FieldName>
        <Tag>
          <Name>requests_per_sec</Name>
          <Value>asp_classic</Value>
        </Tag>
      </Counter>
      <Counter>
        <CategoryName>ASP.NET</CategoryName>
        <CounterName>Requests Current</CounterName>
        <InstanceName></InstanceName>        
    <FieldName>asp_current_requests</FieldName>
      </Counter>
      <Counter>
        <CategoryName>ASP.NET</CategoryName>
        <CounterName>Requests Queued</CounterName>
        <InstanceName></InstanceName>
    <FieldName>aspnet_requests_queued</FieldName>
        <Tag>
          <Name>requests_queue</Name>
          <Value>asp_net</Value>
        </Tag>
      </Counter>
      <Counter>
        <CategoryName>APP_POOL_WAS</CategoryName>
        <CounterName>Current Worker Processes</CounterName>
        <InstanceName>_Total</InstanceName>
    <FieldName>current_worker_processes</FieldName>
        <Tag>
          <Name>web-server-usage</Name>
          <Value>worker-processes-running</Value>
        </Tag>
      </Counter>
      <Counter>
        <CategoryName>W3SVC_W3WP</CategoryName>
        <CounterName>Active Requests</CounterName>
        <InstanceName>_Total</InstanceName>
    <FieldName>active_requests</FieldName>
        <Tag>
          <Name>web-server-usage</Name>
          <Value>active-requests</Value>
        </Tag>
      </Counter>
      <Counter>
        <CategoryName>Web Service</CategoryName>
        <CounterName>Current Connections</CounterName>
        <InstanceName>*</InstanceName>
    <FieldName>current_connections</FieldName>
      </Counter>
    </CounterGroup>
  </CounterGroups>
</Influx-Capacitor>

This configuration brings me counters for every website installed on a server but, because this server got new websites installed and other websites are deleted every time, new counters will always appears or fade away, causing error Engine web-server: Object reference not set to an instance of an object. (#59). Apparently this error was solved after add an value to RefreshInstanceInterval, but the problem of servers stopping send data wasn't.

Right now, I'm stucked with some servers that stopped to send data to the endpoint on a few hours ago but, the last log was wrote about two days ago. The error generated it's available bellow (no logs were generated in same time the Influx-Capacitor stops sent data):

ERROR - Log entry string is too long. A string written to the event log cannot exceed 32766 characters.
System.ArgumentException: Log entry string is too long. A string written to the event log cannot exceed 32766 characters.
   at System.Diagnostics.EventLogInternal.InternalWriteEvent(UInt32 eventID, UInt16 category, EventLogEntryType type, String[] strings, Byte[] rawData, String currentMachineName)
   at System.Diagnostics.EventLogInternal.WriteEntry(String message, EventLogEntryType type, Int32 eventID, Int16 category, Byte[] rawData)
   at System.Diagnostics.EventLog.WriteEntry(String source, String message, EventLogEntryType type, Int32 eventID, Int16 category, Byte[] rawData)
   at Tharga.Toolkit.Console.Command.Base.ServerConsole.WriteLineEx(String value, OutputLevel level)
   at Tharga.Toolkit.Console.Command.Base.SystemConsoleBase.WriteLine(String value, OutputLevel level, Nullable`1 consoleColor)
   at Tharga.InfluxCapacitor.Service.WindowsService.SendBusinessEvent(Object sender, SendCompleteEventArgs e)
   at Tharga.InfluxCapacitor.Collector.Business.SendBusiness.OnSendBusinessEvent(Object sender, SendCompleteEventArgs e)
   at Tharga.Influx_Capacitor.Sender.InfluxDataSender.OnSendBusinessEvent(SendCompleteEventArgs e)
   at Tharga.Influx_Capacitor.Sender.InfluxDataSender.Send()
   at Tharga.InfluxCapacitor.Collector.Business.SendBusiness.Elapsed(Object sender, ElapsedEventArgs e)
ERROR - One or more errors occurred.
. . . 
A LOT OF DATA 
. . .
ERROR - 22/05/2016 06:53:06: An error occurred while sending the request.
WARN - 22/05/2016 06:53:06: Putting 634 points back in the queue.

No other logs are available on Event Viewer or in the log file. The queue is empty. The only other information logged was this one (which doesn't match with the time where server stops sending data):

ERROR - 21/05/2016 06:18:09: InfluxDb API responded with status code=BadRequest, response={"error":"partial write:\nunable to parse 'Influx-Capacitor-Metadata,counter=queueCount,hostname=HOSTNAME,version=1.0.18.0,action=Send,targetServer=http://ENDPOINT,targetDatabase=DATABASE,failMessage=InfluxDb\\ API\\ responded\\ with\\ status\\ code\\=BadRequest\\,\\ response\\={\"error\":\"partial\\ write:\\nunable\\ to\\ parse\\ 'Influx-Capacitor-Metadata\\,counter\\=queueCount\\,hostname\\=HOSTNAME\\,version\\=1.0.18.0\\,action\\=Send\\,targetServer\\=http://ENDPOINT\\,targetDatabase\\=DATABASE\\,failMessage\\=InfluxDb\\\\\\ API\\\\\\ responded\\\\\\ with\\\\\\ status\\\\\\ code\\\\\\=BadRequest\\\\\\,\\\\\\ response\\\\\\={\\\"error\\\":\\\"unable\\\\\\ to\\\\\\ parse\\\\\\ 'Influx-Capacitor-Metadata\\\\\\,counter\\\\\\=queueCount\\\\\\,hostname\\\\\\=HOSTNAME\\\\\\,version\\\\\\=1.0.18.0\\\\\\,action\\\\\\=Send\\\\\\,targetServer\\\\\\=http://ENDPOINT\\\\\\,targetDatabase\\\\\\=DATABASE\\\\\\,failMessage\\\\\\=InfluxDb\\\\\\\\\\\\\\ API\\\\\\\\\\\\\\ responded\\\\\\\\\\\\\\ with\\\\\\\\\\\\\\ status\\\\\\\\\\\\\\ code\\\\\\\\\\\\\\=ServiceUnavailable\\\\\\\\\\\\\\,\\\\\\\\\\\\\\ response\\\\\\\\\\\\\\=\\\\u003chtml\\\\u003e\\\\u003cbody\\\\u003e\\\\u003ch1\\\\u003e503\\\\\\\\\\\\\\ Service\\\\\\\\\\\\\\ Unavailable\\\\u003c/h1\\\\u003e':\\\\\\ missing\\\\\\ fields\\\\nunable\\\\\\ to\\\\\\ parse\\\\\\ 'No\\\\\\\\\\\\\\ server\\\\\\\\\\\\\\ is\\\\\\\\\\\\\\ available\\\\\\\\\\\\\\ to\\\\\\\\\\\\\\ handle\\\\\\\\\\\\\\ this\\\\\\\\\\\\\\ request.':\\\\\\ missing\\\\\\ fields\\\\nunable\\\\\\ to\\\\\\ parse\\\\\\ '\\\\u003c/body\\\\u003e\\\\u003c/html\\\\u003e':\\\\\\ missing\\\\\\ fields\\\"}':\\ missing\\ fields\\nunable\\ to\\ parse\\ 'value\\=1.0\\,queueCountChange\\=-594i\\,sendTimeMs\\=4062.0\\ 1463822245574':\\ invalid\\ field\\ format\"}': missing fields\nunable to parse 'value=31.0,queueCountChange=0i,sendTimeMs=17766.0 1463822279293': invalid field format"}

WARN - 21/05/2016 06:18:09: Dropping 52 since the exception type InfluxDB.Net.Infrastructure.Influx.InfluxDbApiException is not allowed for resend.
ERROR - One or more errors occurred.

This issue is temporarily fixed only when Influx-Capacitor is restarted. I'm using Capacitor 1.0.18.0.

poxet commented 8 years ago

I am not sure on how to solve this. Perhaps the problem is related to the counters that appears and disappeares. In your case when the websites are added and removed.

I will try to have a look at that.

daniloalsilva commented 8 years ago

Hey @poxet, just an update:

I've splitted all counters of 'web-server' CounterGroup for debugging (used to IIS counters) because when Tharga.Toolkit.Console stops working, it stops only for a CounterGroup - other counterGroups like the ones with CPU and memory aren't affected when this problem occurs. I also replace counter Current Connections, collecting only "_Total".

Doing this was possible to trace back when a specific counter stops working by the last commit in Influx DB:

28/06/2016 17:05:00 : Last data available of counter W3SVC_W3WP.Active Requests._Total
28/06/2016 20:25:00 : Last data available of counter APP_POOL_WAS.Current Worker Processes._Total

Here is the Event Viewer related to the problem::

28/06/2016 07:13:32 - IIS-W3SVC-PerfCounters
It has taken too long to refresh the W3SVC counters, the stale counters are being used instead.

28/06/2016 16:51:38 - IIS-W3SVC-PerfCounters
It has taken too long to refresh the W3SVC counters, the stale counters are being used instead.  This is the second message within the past 12:00:00 (hours, minutes, seconds).  No more stale counter messages will be logged for this client session until the time limit expires. 
For additional information specific to this message please visit the Microsoft Online Support site located at: http://go.microsoft.com/fwlink?linkid=538.

28/06/2016 19:30:11 - IIS-W3SVC-PerfCounters
It has taken too long to refresh the W3SVC counters, the stale counters are being used instead.

28/06/2016 23:31:43 - PerfNet
Unable to open the Server service performance object. The first four bytes (DWORD) of the Data section contains the status code.

28/06/2016 23:34:06 - Tharga.Toolkit.Console
Unable to get performance counter .APP_POOL_WAS.Current Worker Processes._Total. Category does not exist.

28/06/2016 23:34:06 - Tharga.Toolkit.Console
Unable to get performance counter .W3SVC_W3WP.Active Requests._Total. Category does not exist.

28/06/2016 23:34:06 - Tharga.Toolkit.Console
Engine web-server: Object reference not set to an instance of an object.

28/06/2016 23:34:08 - Influx-Capacitor
Service started successfully.

And here the logs generated by the application:

WARN - 28/06/2016 23:34:06: Unable to get performance counter .APP_POOL_WAS.Current Worker Processes._Total. Category does not exist.
WARN - 28/06/2016 23:34:06: Unable to get performance counter .W3SVC_W3WP.Active Requests._Total. Category does not exist.
ERROR - 28/06/2016 23:34:06: Engine web-server: Object reference not set to an instance of an object.
WARN - 28/06/2016 23:34:07: Unable to get performance counter .APP_POOL_WAS.Current Worker Processes._Total. Category does not exist.
ERROR - 28/06/2016 23:34:07: Engine web-testing-proc: Object reference not set to an instance of an object.
WARN - 28/06/2016 23:34:07: Unable to get performance counter .W3SVC_W3WP.Active Requests._Total. Category does not exist.
ERROR - 28/06/2016 23:34:07: Engine web-testing-request: Object reference not set to an instance of an object.
ERROR - 28/06/2016 20:35:08: Engine web-server: Object reference not set to an instance of an object.
ERROR - 28/06/2016 20:35:09: Engine web-testing-proc: Object reference not set to an instance of an object.
ERROR - 28/06/2016 20:35:09: Engine web-testing-request: Object reference not set to an instance of an object.
ERROR - 28/06/2016 20:36:08: Engine web-server: Object reference not set to an instance of an object.
ERROR - 28/06/2016 20:36:09: Engine web-testing-proc: Object reference not set to an instance of an object.
ERROR - 28/06/2016 20:36:09: Engine web-testing-request: Object reference not set to an instance of an object.
ERROR - 28/06/2016 20:37:08: Engine web-server: Object reference not set to an instance of an object.
ERROR - 28/06/2016 20:37:09: Engine web-testing-proc: Object reference not set to an instance of an object.
ERROR - 28/06/2016 20:37:09: Engine web-testing-request: Object reference not set to an instance of an object.
ERROR - 28/06/2016 20:38:08: Engine web-server: Object reference not set to an instance of an object.

AND SO ON

If I discover anything else, I'll inform!

daniloalsilva commented 8 years ago

Okay, I think I found the root for this problem! Some counters throws an exception when called and doesn't have any data to return. I've simulated this on a new server who haven't received any web requests for a long time. In this case, instead of reporting "0" to counters, the following exception was throw:

PS C:\Users\Administrator> Get-Counter '\W3SVC_W3WP(_total)\Active Requests'
Get-Counter : No data to return.
At line:1 char:1
+ Get-Counter '\W3SVC_W3WP(_total)\Active Requests'
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidResult: (:) [Get-Counter], Exception
    + FullyQualifiedErrorId : CounterApiError,Microsoft.PowerShell.Commands.GetCounterCommand

PS C:\Users\Administrator> $error | select *

writeErrorStream      : True
PSMessageDetails      :
Exception             : System.Exception: No data to return.
TargetObject          :
CategoryInfo          : InvalidResult: (:) [Get-Counter], Exception
FullyQualifiedErrorId : CounterApiError,Microsoft.PowerShell.Commands.GetCounterCommand
ErrorDetails          :
InvocationInfo        : System.Management.Automation.InvocationInfo
ScriptStackTrace      : at <ScriptBlock>, <No file>: line 1
PipelineIterationInfo : {0, 1}

When I hit a http page for a website in this server, even if the request wasn't live, the counter starts to return "0":

PS C:\Users\Administrator> Get-Counter '\W3SVC_W3WP(_total)\Active Requests'

Timestamp                 CounterSamples
---------                 --------------
6/29/2016 8:58:44 PM      \\pleskcw0093\w3svc_w3wp(_total)\active requests :
                          0

PS C:\Users\Administrator> Get-Counter '\W3SVC_W3WP(*)\Active Requests'

Timestamp                 CounterSamples
---------                 --------------
6/29/2016 9:00:41 PM      \\pleskcw0093\w3svc_w3wp(_total)\active requests :
                          0

                          \\pleskcw0093\w3svc_w3wp(6088_pleskcontrolpanel)\active requests :
                          0

I think we can prevent this problem with a try/catch when counter returns this specific exception.

poxet commented 8 years ago

Great that you found that! Thanks!