influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.6k stars 5.57k forks source link

Add aggregator to calculate derivative and non_negative_derivative #989

Closed smalenfant closed 3 years ago

smalenfant commented 8 years ago

I have to aggregate reads/write over 24 disk devices to see throughput. The problem is that InfluxDB doesn't behave well with derivative and CQ with derivative don't work at all (up to 0.11). Having the rates per seconds calculated by Telegraf directly would be great. Mostly for read_bytes and write_bytes but not limited to this. Same would be applicable to the net module. Overall, even if derivative would work, this would make it much simpler to sum up with rates since this would require only a single Query.

Example of query used that shows a spike at the beginning and end :

SELECT derivative(sum("read_bytes"), 10s) AS "Read" FROM "diskio" WHERE "hostname" =~ /$Hostname$/ AND $timeFilter GROUP BY time(1m) fill(null)

wgrcunha commented 8 years ago

Same here

👍

phemmer commented 8 years ago

You want derivative(last("read_bytes")) (last not sum). sum is going to add all the results in a time-group together. If you have one time-group that looks like [1,2] and the next one is [4,6,9], summation is going to give you values of 3 & 19, and it'll look like you have a read_bytes difference of 16. When really the bytes read between the two time-groups is 7 (which is 9-2).

shanielh commented 8 years ago

I have the same problem with net plugin - all the fields over there are accumulative and I'm using InfluxDB 0.9.

I can submit a PR if you guys want.

MarkMartinec commented 7 years ago

I would suggest generalizing the title of this PR, as dealing with counters is a common issue, not specific to diskio.

Currently the InfluxDB (1.1) is still unable to deal with counter wraparounds - the _non_negativederivative transformation looks like un ugly hack to deal with this. It is also unable to provide something like 'a 5 minute maximum of a one-second counter rate'.

Delegating deriving of a counter rate to telegraf (e.g. as its aggregator plugin) would solve both InfluxDB issues. It also seems that telegraf is in a better position to recognize and handle counter wraparounds and provide a proper derivative.

phemmer commented 7 years ago

It is also unable to provide something like 'a 5 minute maximum of a one-second counter rate'.

Subqueries should be able to handle this: https://github.com/influxdata/influxdb/pull/7646

It also seems that telegraf is in a better position to recognize and handle counter wraparounds and provide a proper derivative.

How do you figure? Telegraf is just reading a number. It doesn't know when the value is going to reset. Whatever logic telegraf uses influxdb can use to.

I would also argue that this problem is not specific to telegraf. Thus addressing it in influxdb solves it everywhere.

vishiy commented 5 years ago

is there a way to convert cumulative/total metrics to 'rate' thru aggregators in telegraf ? I am looking at net metrics like bytes_recv, bytes_sent and charting them looks like a mountain slope :) I am not using influxDB output so cannot use derivative.

danielnelson commented 5 years ago

@vishiy This pull request would implement it #4435

phemmer commented 5 years ago

Depending on whatever format you're outputting the data in, you may also use kapacitor. It can do all sorts of data processing, and then forward it on. But I don't think it supports all the destinations & formats telegraf does.

vishiy commented 5 years ago

@vishiy This pull request would implement it #4435

@danielnelson - thank you. When will this be merged and available ? do you know ?

danielnelson commented 5 years ago

I'm not sure, I haven't had time to look at it closely yet.

aurimasplu commented 5 years ago

I also agree @MarkMartinec that this could be generalized a common counter issue. It would solve a lot of problems if telegraf would be able to aggregate derivatives (rates or deltas) and write to output already calculated values. It similar issues it was answered by Influx developers that writing raw counter values to database gives more flexibility. At first glance- yes, actually- not. Because that flexibility brings a lot of limitations especially in Grafana and Elasticsearch (in my case). For example Elasticsearch can not sort on derivative aggregation. Also Grafana loading time is impacted because in the background it has to load 3-4 times more data when just loading avg.

megastef commented 5 years ago

Is there a plan for an "processing" plugin that aggregate derivatives (rates or deltas)? Would you accept a PR for such a processor?

I developed several monitoring agents in nodejs. I used deltas for all growing metrics, but could not find anything in telegraf processor plugins.

Example in my nginx agent: https://github.com/sematext/sematext-agent-nginx/blob/master/lib/aggregator.js#L31-L37

Always growing values are really a bit of problem, depending your data store and query/visualisation options. Please don't assume everybody is using InfluxDb.

I would really like to see a processor for aggregate derivatives (rates or deltas), and i would also contribute code, unfortunately my GO skills are "read-only".

danielnelson commented 5 years ago

Yes, there is a pull request open to do deltas https://github.com/influxdata/telegraf/pull/4435, and I am planning to take a look at it as soon as I can.

Please don't assume everybody is using InfluxDb.

Can I ask what store/vis you are using that doesn't support counters?

megastef commented 5 years ago

I use Sematext Cloud - currently, charts would render counters with the always growing absolute value. Therefore Sematext agents calculate delta or rates. This might change one day, nevertheless, I think the possibility to calculate deltas and rates is really a handy feature for any monitoring agent. On the client-side, it needs fewer calculations at query-time or render-time and users don't even need to think of the problem.

reimda commented 3 years ago

4435 and #3762 provide difference and derivative functionality that should cover the use cases described here. Please try them out. If they don't meet your needs, please open new issues with the details and refer to this issue. Thanks!