poxet / Influx-Capacitor

Influx-capacitor collects metrics from windows machines using Performance Counters. Data is sent to influxDB to be viewable by grafana.
http://influx-capacitor.com
MIT License
44 stars 13 forks source link

Service stops to send collected data #47

Closed marcelopetersen closed 8 years ago

marcelopetersen commented 8 years ago

Hi,

I'm facing some issues with agent, where sometimes, the collector service is up but no data is sent to influxdb. I've installed the agent in a lot of servers, and many of then has the same behavior. Restarting service fix the issue, but it happens again after no specific time.

Have you ever faced this behavior?

There's an option to enable trace or detailed log to help understand what is happening?

Regards,

poxet commented 8 years ago

This is something I have not encountered.

What you can do is to start the console application and check the queue. Type "sender queue"

That shows how many measurement points is still in the queue to be sent. It shall be flushed with an interval when data is sent to InfluxDB. If it just increases there is something wrong.

Logging is not a bad idea. It would help allot.

Are you using the SafeCollectorEngine or ExactCollectorEngine? (SafeCollectorEngine is default)

marcelopetersen commented 8 years ago

I'm using the default collector engine (SafeCollectorEngine).

I've noticed that, sometimes when the files are replaced by group policy (I'm using GPO to send all settings/counters files to servers), the service stops to work and I need to restart it.

Yesterday I changed the configuration of GPO to "Create" instead of "Replace" files and today a small piece of the servers stopped to report itself (almost 20 of 150). In the night before yesterday, was almost the total of servers.

With this new approuch, the configuratin files will be created only if doens't exists and if I change something in that, the servers will not receive the changes. I'll investigate a little more.

poxet commented 8 years ago

So the issue can have something to do with the configuration files? If so, would issue #29 solve this problem?

marcelopetersen commented 8 years ago

Normally, it's a good solution use group policy to ensure that configuration files are the same in the whole environment (centralizing on active directory is easier instead of on different locations for each application).

Analyzing the servers, I could identify that some errors was logged into event viewer after restart the service to fix the issue:

Log Name: Application Source: Influx-Capacitor Date: 25/02/2016 12:23:32 Event ID: 0 Task Category: None Level: Information Keywords: Classic User: N/A Computer: ComputerName Description: Service started successfully.

Log Name: Application Source: Tharga.Toolkit.Console Date: 25/02/2016 12:23:30 Event ID: 0 Task Category: None Level: Warning Keywords: Classic User: N/A Computer: ComputerName Description: Unable to get performance counter Active Server Pages.Requests Queued.. Category does not exist.

Log Name: Application Source: Tharga.Toolkit.Console Date: 25/02/2016 12:23:32 Event ID: 0 Task Category: None Level: Error Keywords: Classic User: N/A Computer: ComputerName Description: Engine web-server: Object reference not set to an instance of an object.

web-server is a counter group in my counters file and as can see above, when a category is not found, an error message is created instead of ignore the collector. If the service logs that it's ignoring the counter on event viewer, will be very useful for troubleshooting.

Another errors that I've found (these messages have been sent to influxdb after enable metadata property):

**- A task was canceled.

Googling for these messages, seems that is generated when a request is timed out. http://stackoverflow.com/questions/29179848/httpclient-a-task-was-cancelled

Nowadays, is not possible to define request timeout via configuration file.

poxet commented 8 years ago

Yes, that should most surely be improved! I will tag this as a bug.

tbolon commented 8 years ago

Perhaps it is related to this line: https://github.com/poxet/Influx-Capacitor/blob/master/Tharga.Influx-Capacitor.Collector/Business/CounterBusiness.cs#L108

I do not understand the goal of registering a performanceCounterInfos when no performanceCounters have been found. Besides, passing null as PerformanceCounter will result in NRE (on master branch there are now tests against null.

Perhaps this lines should be removed purely and simply ?

marcelopetersen commented 8 years ago

Looking at more servers that stopped to send data, I found more error in event viewer:

Log Name: Application Source: Tharga.Toolkit.Console Date: 26/02/2016 05:07:35 Event ID: 0 Task Category: None Level: Error Keywords: Classic User: N/A Computer: ComputerName Description: Engine ftp-server: Collection was modified after the enumerator was instantiated.

I couldn't identify what is the error, because the counters related to "ftp-server" group, has no queries for instances and never change:

<CounterGroup Name="ftp-server" SecondsInterval="60" RefreshInstanceInterval="0">
      <Counter>
        <CategoryName>Microsoft FTP Service</CategoryName>
        <CounterName>Current Connections</CounterName>
        <InstanceName>_Total</InstanceName>
      </Counter>
      <Counter>
        <CategoryName>Microsoft FTP Service</CategoryName>
        <CounterName>Bytes Received/sec</CounterName>
        <InstanceName>_Total</InstanceName>
      </Counter>
      <Counter>
        <CategoryName>Microsoft FTP Service</CategoryName>
        <CounterName>Bytes Sent/sec</CounterName>
        <InstanceName>_Total</InstanceName>
      </Counter>
      <Counter>
        <CategoryName>Microsoft FTP Service</CategoryName>
        <CounterName>Bytes Total/sec</CounterName>
        <InstanceName>_Total</InstanceName>
      </Counter>
    </CounterGroup>

And the same occurs to "memory" counter group:

Log Name: Application Source: Tharga.Toolkit.Console Date: 26/02/2016 02:00:32 Event ID: 0 Task Category: None Level: Error Keywords: Classic User: N/A Computer: ComputerName Description: Engine memory: Collection was modified after the enumerator was instantiated.

Counter group:

<CounterGroup Name="memory" SecondsInterval="60" RefreshInstanceInterval="0">
      <Counter>
        <CategoryName>Memory</CategoryName>
        <CounterName>% Committed Bytes In Use</CounterName>
        <InstanceName></InstanceName>
      </Counter>
      <Counter>
        <CategoryName>Memory</CategoryName>
        <CounterName>Available MBytes</CounterName>
        <InstanceName></InstanceName>
      </Counter>
      <Counter>
        <CategoryName>Memory</CategoryName>
        <CounterName>Committed Bytes</CounterName>
        <InstanceName></InstanceName>
      </Counter>
      <Counter>
        <CategoryName>Memory</CategoryName>
        <CounterName>Pages/sec</CounterName>
        <InstanceName></InstanceName>
      </Counter>
    </CounterGroup>

And more errors related to .net tasks:

Log Name: Application Source: Tharga.Toolkit.Console Date: 26/02/2016 01:25:31 Event ID: 0 Task Category: None Level: Warning Keywords: Classic User: N/A Computer: ComputerName Description: Dropping 125 since the exception type System.Threading.Tasks.TaskCanceledException is not allowed for resend.

marcelopetersen commented 8 years ago

Another kind of exception on servers that stopped to send data:

Log Name: Application Source: Tharga.Toolkit.Console Date: 26/02/2016 10:06:30 Event ID: 0 Task Category: None Level: Warning Keywords: Classic User: N/A Computer: ComputerName Description: Dropping 125 since the exception type System.InvalidOperationException is not allowed for resend.

tbolon commented 8 years ago

FYI, the collector still works in our case with no interruption since we upgraded to the latest version.

poxet commented 8 years ago

Cool! I will close this issue then. (Reopen if it appears again)