poxet / Influx-Capacitor

Influx-capacitor collects metrics from windows machines using Performance Counters. Data is sent to influxDB to be viewable by grafana.
http://influx-capacitor.com
MIT License
44 stars 13 forks source link

Stop collecting after midnight #55

Closed tbolon closed 8 years ago

tbolon commented 8 years ago

For some reason, all our servers stopped sending datas to influxdb after midnight yesterday. I was able to reproduce the problem using server console:

InfluxDb API responded with status code=BadRequest, response={"error":"write fai
led: field type conflict: input field \"queueCountChange\" on measurement \"Infl
ux-Capacitor-Metadata\" is type float64, already exists as type integer"}

For now I am looging at the code to see if I find the culprit.

The installed version was 1.0.15.

tbolon commented 8 years ago

As a workaround, we have dropped the measurement Influx-Capacitor-Metadata, and it has solved the problem.

fzavalloni commented 8 years ago

I have done this but even so it stops reporting for a specific period of time.

poxet commented 8 years ago

It might depend on the component used to send data to influx. If the first value is an integer, then the counter is set to that type. If the following value is a decimal type, then it will crash.

I will have a look at that component, it should send "0.0" for "0"-values, if it is a decimal type.

Perhaps the code only checks for float32, and not float64. Ill check that.

poxet commented 8 years ago

What version of influxDB are you using. There was a number format change in version 0.9.3. Are you using an earlier version when you got the "type float64, already exists as type integer" issue?

tbolon commented 8 years ago

It is the last version (0.10).

poxet commented 8 years ago

As I understand the new version of influxDB should "upgrade" fields from int to float automatically, if data is not specifically sent with a trailing i. (Ex. 12345i). I am not sure what to make of this issue, or how to solve it.

tbolon commented 8 years ago

Here we go again:

InfluxDb API responded with status code=BadRequest, response={"error":"write failed: field type conflict: input field \"queueCountChange\" on measurement \"Influx-Capacitor-Metadata\" is type float64, already exists as type integer"}
> Dropping 90 since the exception type InfluxDB.Net.InfluxDbApiException is not allowed for resend. 

We stopped collecting data at midnight GMT exactly (0h50 GMT+1)

I do not understand why, suddently, this measure should be sent as float64. Last time we dropped the Influx-Capacitor-Metadata measurement to start collecting again. For now I did not touch anything. Restarting the service did not help.

poxet commented 8 years ago

The only way forward with this I think is to collect more logging data to see what happens. So, I will add logging for the exception "Dropping x since the exception type y is not allowed for resend". If we have exact information of what that message contains, then perhaps we can found out what is wrong.

Perhaps it could be a small decimal number like 0.0000000001 that the counter cannot handle, who knows. :)

Perhaps the hint is that it is a float64 that is trying to be sent, instead of a float32. (Not sure what is the common format. So, logging would be the next step I think.

poxet commented 8 years ago

Now there is an official version 1.0.17.0 with debugging disabled is put in production. To get the correct logging, enable the content of the log4net section in the service config file, make sure that you have write permissions to the path. To make sure that log data is written, try running the service with setting <level value="ALL" /> but use the setting <level value="WARN" /> for normal usage, since the log will contain too much data otherwise.

As soon as the issue appears all data points are written to the log, and we can start investigating why this error appears.

tbolon commented 8 years ago

We have installed this version and waiting for the problem to arise again. We will then enable ALL level.

poxet commented 8 years ago

I will close this issue. (Reopen if it appears again)