influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

aggregators.derivative should support non-negative #10668

Open nward opened 2 years ago

nward commented 2 years ago

Feature Request

Proposal:

When calculating a derivative with the derivative aggregator plugin, calculated values can sometimes be less than zero which in many cases (such as counters) is not desired.

Proposal is to add an option to the aggregator plugin which assumes that the previous value is zero, if the current value is less than the previous value.

Current behavior:

aggregators.derivative can return values which are less than zero which is not desirable in many cases.

Desired behavior:

aggregators.derivative returns only positive values when configured to do so.

Use case:

SNMP polling can return a value lower than the previous one if a device restarts, an interface is deleted+re-created, or (with some vendors) if counters are cleared. We need this to show positive in this case - as negative is not valid.

KarstenSchnitter commented 2 years ago

I think, there are two aspects to this:

First is the calculation and pushing of the derivative at this line: https://github.com/influxdata/telegraf/blob/3c600cde72fd4b6ef3898936815f2aec88b8a066/plugins/aggregators/derivative/derivative.go#L193 At this point it can be decided to not emit negative values.

A second point is concerned with the roll-over behaviour, that handles empty or single value aggregation intervals. I think it would be a good idea to replace the stored first element of an aggregate, when a smaller metric value is encountered in https://github.com/influxdata/telegraf/blob/3c600cde72fd4b6ef3898936815f2aec88b8a066/plugins/aggregators/derivative/derivative.go#L99-L112 That would ensure non-negative values in the later aggregation. Because of the semantics, this feature would be less a non-negative derivative but support for monotonously increasing time series.

Hipska commented 2 years ago

I think the request for a non-negative derivative is a valid one. But I don't do this calculations with telegraf, I always store the raw counter values directly into the DB. The DB's then can handle these situations much better.

nward commented 2 years ago

Hi @KarstenSchnitter I don't think simply not emitting negatives is the right approach. Better is to assume the previous value was a zero. I'm going to have to have more of a think about your second point as it's a bit late and has been a long day :-)

@Hipska yeah - I am pushing this in to the poller because we almost never want the raw value, so calculating the derivative many times as dashboards reload is pretty inefficient - and performance ends up being really quite poor if we calculate derivatives for say 30 series, with 24h of 1m data. If we only have to fetch 30 series it's significantly faster. It did occur to me that other pollers understand numbers as counter vs. gauge and can calculate non-negative derivatives on counters automatically. Maybe that's a wider question to consider..

Hipska commented 2 years ago

BTW can't what you want be done by the aggregators.basicstats plugin?

KarstenSchnitter commented 2 years ago

@nward my second suggestion would not just omit values. Changing the aggregate bounds would effectively shorten the interval but provide a value. Say we get the following series:

example value=100 0
example value=120 10
example value=0 20
example value=10 30

It would then have the following aggregation bounds after each step:

time value aggregation.first aggregation.last derivative
0 100 (0, 100) (0, 100) none
10 120 (0, 100) (10, 120) 2
20 0 (20, 0) (20, 0) none
30 10 (20, 0) (30, 10) 1

In case, where no derivative can be provided, the single values would roll-over to the next period. This helps to provide a derivative in the next period.

The situation at time 20 is similar to receiving just one value in the aggregation period. Currently the plugin would not emit a derivative in the first such period either. With the counter reset, this behaviour might be encountered, when the reset event is not followed by another event in that period.

@Hipska this example should show, that these plugins behave differently, when metrics interval and aggregation period get equal or very close to each other. My second proposal would improve on the reset situation over the basicstats aggregator.

nward commented 2 years ago

@KarstenSchnitter sorry for the delay getting back to you.

I think your suggestion makes sense - but only because the Flux derivative() function appears to do the same.

I note that the Flux documentation says:

... if a value is less than the previous value, it is assumed the previous value should have been a zero.

Yet, the code does not actually seem to match this: https://github.com/influxdata/flux/blob/003954e0933e0dd05ce400858169e22a28053670/stdlib/universe/derivative.gen.go#L76-L79

        if d.nonNegative && pv > cv {
            // The previous value is greater than the current
            // value and non-negative was set.
            b.AppendNull()

Am I reading that right? That is of course a Flux query, rather than Telegraf..

The logic in my original proposal - i.e. to assume the previous value was 0 - was based on this documentation.

I will have a stab at implementing this to match the logic in Flux.

matthijskooijman commented 2 years ago

The discussion so far talks about correcting the first and last value in the interval to enforce non-negativity, but shouldn't a more general solution enforce non-negativity between individual measurements? This is not relevant when the input interval matches the integration interval, but when multiple measurements are made in each aggregation interval, this could make a difference.

E.g. assume the input values (ignoring timestamps) for a given aggregation period (assuming max_roll_over=0 for simplicity) are [100, 110, 0, 10]. Currently, this, I think, produces a derivative of (10 - 100) / timestamp_diff. The proposals above, I think, suggest that the smallest value should be used as the first value for calculations, producing a derivative of (10 - 0) / timestamp_diff. If non-negativity is applied to each pair of subsequent measurements (omitting negative steps), this would produce a derivative of ((110-100) + (10 - 0)) / timestamp_diff, which would be closer to the true value (it would still potentially have a value that is too low, since the reset to 0 might have happened at, say, value 120 rather than value 110 as this approach assumes). I also believe that this is essentially what non_negative_diff and non_negative_derivative in Influx also do.

Writing this, I realize that the simpler approach suggested by others above is probably also fair (and probably a lot easier to implement). It might more heavily underestimate the true value (but even my suggested approach still does this), but this only happens on a (usually somewhat rare) counter reset. More important is that both approaches should at least not produce (potentially very large) negative values (which are really incorrect and more likely to mess up later queries and analysis).