poxet / Influx-Capacitor

Influx-capacitor collects metrics from windows machines using Performance Counters. Data is sent to influxDB to be viewable by grafana.
http://influx-capacitor.com
MIT License
44 stars 13 forks source link

Service fails to send data just after startup then never succeeds #24

Closed zeugfr closed 8 years ago

zeugfr commented 8 years ago

I have a strange bug on several setups that has big impact as I try to generalize Influx-Capacitor in my servers and I get many failures.

It seems to occur more frequently with many counters and high collect rates (in my situation : about 800 counters collected every 10 seconds and refreshed every 30 collects, sent to DB every 20 second). In these situations, errors like "engine my.custom.metrics": Dropping x keys. appear in the event log. Then no metrics go to InfluxDB and the process takes more and more memory, and does not respond to stop service requests. It seems that putting lower collect or db-send refresh rates or db refresh rate makes this thing happen a few times less (at the beginning my setup was 5 sec collect and feed to DB).

The behaviour is also reproductible from console with "counter collect" command.

Perhaps it has something to do with first connexion to InfluxDB ? Our environments feeds to 2 InfluxDB but I also had the problem with only 1 InfluxDB.

poxet commented 8 years ago

Perhaps it is the large number of counters that causes the problems. I will have a look at this so that the service is not crashing.

If the interval is set to 10 seconds and the collection of counter data is not able to start collecting within the correct time (because the precvious collection took too long) then that read will be "dropped" so that the following read can be done in time. That is the idea at least.

Ill have a look at the problem.

Elufimov commented 8 years ago

I faced with this issue too. I have ~1000 metrics.

poxet commented 8 years ago

I have created a different type of collector, so now there are two to choose from.

The default one is "Safe". It takes its time to read counters, then it delays for the time set as interva.

The other one is "Exact" it will read on the same iterval. If it does not manage to read withing the given time-frame it will drop the read so that the next one can be read on time.

For large amount of counters, use the "Safe" one (default). For small number of counters use the "Exact" one, if you want it to be read at the same time on every cycle.

This is included in version 1.0.13

zeugfr commented 8 years ago

Thanks, i'll give it a try and give you feedback on behaviour. :)

I am not completely sure that this is the root cause because I can have the issue if I restart several times the same Influx-Capacitor (same counters, same targets). And when it fails, that is not some of the times : it fails at first then never succeeds and takes more and more memory (does not seem to discard unsent counters). Strangely I succeeded to have very few errors putting collects every 10 seconds and refresh to InfluxDB every ... 5 seconds ! (with about 850 counters).

poxet commented 8 years ago

Tell me if you still find problems with this, Then we can reopen the case and continue working on it.