Counter Reset / Rate Aggs

What's the functionality you would like to add

The counter reset problem outlined in #4 and the rate/delta problem outlined in #8 can be solved, at least in part by choosing an intermediate aggregate that allows us to do part of the work before having access to the full data set.

There are 2 parts to the solution, 1) aggregates with a properly designed partial forms that can at least be partialized for disjoint time ranges (ones specific to counters and gauges) and 2) a set of functions designed to work on these partial forms to return the rate/delta and other functions outlined in issue #8.

Proposed aggregates:

counter_agg(value, time)
gauge_agg(value, time)
saturating_counter_agg(value, time, reset_val)
each of these aggs with the form counter_agg(counter_agg_partial) (implicitly performing the combine function on the aggregate, for re-aggregation purposes)

Proposed functions to act on these aggregates:

n.b.: last_val is adjusted for resets if it is a counter_agg (and not if it is a gauge_agg)

rate(counter_agg(value, time)) -> the simple observed rate, so (last_val - first_val)/(last_time - first_time)
delta(counter_agg(value, time)) -> the simple observed delta of the value last_val - first_val
time_delta(counter_agg(value, time)) -> the simple observed delta of the time last_time - first_time (useful for determining how well sampled a time range is
extrapolated_rate(counter_agg(value, time), range, method) the rate, but with our values extrapolated out to somewhere near the edges of our bucket (the range provided), extrapolate using the method provided
extrapolated_delta(counter_agg(value, time), range, method)
instantaneous_rate(counter_agg(value, time), edge) -> the instantaneous rate ((val(n) - val(n-1))/(time(n)-time(n-1)) at either the leading or trailing edge of the range (default to trailing as that's Prom's behavior, though it seems a bit weird to time_bucket which provides the value at the beginning of the range, and then provide the value at the end?
instantaneous_delta(counter_agg(value, time), edge) -> instantaneous delta analogous to above

Technical details on partials

The partial we create would be an intermediate form that stores:

The first 2 and last 2 adjusted values within a time range (two values so that we can do the instantaneous calculations)
the sum of any counter reset values within the range (where the reset_value = val_before_reset + val_after_reset, which is also equal to the adjusted_counter value at the last reset in the range)

ie for the example above:

first: 2020-12-02 21:47:50+00, 10 
second: 2020-12-02 21:57:50+00, 120
penultimate: 2020-12-02 22:37:50+00, 8950
last: 2020-12-02 22:47:50+00 , 11960
reset_sum: 8950 // 3045 + 205 + 5200 + 500

We need the reset sum stored in order to determine if there was a counter reset when combining adjacent partials, so, if we were to combine this partial with another, we would use last - reset_sum as the raw_last value to compare to the first value of the next partial in order to determine if we had a counter reset at the partial boundary.

As long as the ranges are disjoint and don't have gaps, you can then order by the first value of the range and calculate the full counter reset by essentially doing the normal counter reset calculations with the first value of the current and the last value of the preceeding and then summing the reset_sums.

This form is quite efficient as it represents an entire range in only 9 values. If you have as few points as our example above, then it's not a huge savings, but when you are reporting more frequently, this can be a huge savings.

The key requirement here is that the time ranges (and any other groups) must be disjoint (non-overlapping, because overlapping time ranges would mean we can't get a consistent ordering and counter resets could easily be double counted) and that there shouldn't be gaps (ie you can't just take the data from last week and the data from today and expect to get the right results, you need the data in the middle or you may miss counter resets).

These requirements are met by two of our important cases: continuous aggregates and partitionwise aggregates. In each case, the aggregations are guaranteed to be on disjoint sections of data (partitions are guaranteed to be disjoint as long as you group by any hash partitioned keys, and counter resets really only make sense at the most granular level) and a) continuous aggregate partials have a disjoint time range and in fact cannot cross chunk boundaries. (The lack of gaps is basically defined by not doing weird things with your where clauses, but you could create similar problems if you did it on the raw data, so we'll leave that to the side for now).

The basic aggregate would buffer the data from a time range, and then, when asked to serialize, would order the data, perform the proper calculations and then serialize this compact form and it can be stored in a continuous aggregate or it can be sent back to the access node as part of a partitionwise aggregate.

How would the function be used

SELECT delta(counter_agg(foo, time)) FROM bar WHERE time >= t1 and time < t2  AND id = 'x'

WITH t as (
    SELECT counter_agg(foo, time) as foo_counter FROM bar WHERE time >= t1 and time < t2
) SELECT delta(foo_counter), rate(foo_counter), instantaneous_rate(foo_counter, 'trailing') FROM t

Continuous aggregate:

CREATE MATERIALIZED VIEW bar_hourly
WITH (timescaledb.continuous) AS
SELECT id,
       time_bucket(INTERVAL '1 hour', time) as bucket, 
       counter_agg(foo, time) as foo_counter,
       gauge_agg(baz, time) as baz_gauge
FROM bar
GROUP BY id, bucket

You can the perform re-aggregation without waiting for continous aggs to support re-aggregation by, running

WITH t as (
SELECT id,
       counter_agg(foo_counter) as foo_counter,
       gauge_agg(foo_counter) as baz_gauge
FROM bar_hourly
WHERE bucket >= t1 AND bucket < t2
GROUP BY id)

SELECT id, rate(foo_counter) as foo_rate, rate(baz_gauge) as baz_rate, time_delta(foo_counter) as foo_td ... etc.
FROM t

(This should also work without the with clause, you can nest the function calls if you want to in either case with something like SELECT rate(counter_agg(foo, time)) as foo_rate, delta(counter_agg(foo, time)) as foo_delta and it should only perform the aggregation once)

Why should this feature be added? Is your feature request related to a problem? A wishlist item? Please describe.

See #4 & #8

Drawbacks

Counter aggs are not fully parallelizable (I'm not sure there is a fully parallelizable solution). (Gauge aggs are)

Separate aggregates from the functions for evaluation might be slightly confusing to users.

Open Questions

Regression Functions

Should we have separate aggregates that allow for regression analysis like with Prom's deriv function? i.e.:

Proposed aggregates:

counter_regr(value, time)
gauge_regr(value, time)
saturating_counter_regr(value, time, reset_val)
each of these aggs with the form counter_regr(counter_regr_partial) (implicitly performing the combine function on the aggregate, for re-aggregation purposes)

These store the same internal state as the regr_accum: https://github.com/postgres/postgres/blob/472e518a44eacd9caac7d618f1b6451672ca4481/src/backend/utils/adt/float.c#L3260 namely: N (sample count), Sx=sum(X), Sxx=sum((X-Sx/N)^2), Sy=sum(Y), Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N), in PG these are all floats, however, we might need to modify that a bit as our X values will be some repr of a timestamp type in many cases (ints might be better for some of these when possible)
When invoked for a counter, it is run on the adjusted counter values. When combining partials for counters, the offset of the previous partial will need to be multiplied by N and added to the Sy value. All the other values are unaffected because they either only involve X, the sample count or rely on the expression (Y-Sy/N), which is basically each value subtracted from the average of all values and if all values are shifted by a constant, then the average shifts by the same constant so it cancels out: i.e. If a constant C is added to each Y, then (Y-Sy/N) becomes: ( (Y + C) - (Sy + NC)/N) = (Y + C - Sy/N - NC/N) = (Y + C - Sy/N - C) = (Y - Sy/N).
This also means that, for the most part, the values stored for the counter_regr aggregate are going to be the same as for the normal counter_agg (because we need them in order to do the offsets correctly) except we add in the extra ones for regression and, I guess we can remove the second and penultimate values, so that the instantaneous bits won't work in that case, and I think that overhead for it to work for some things but not others seems odd...

Proposed functions to act on these aggregates:

I think we can do all the ones here: https://www.postgresql.org/docs/12/functions-aggregate.html#FUNCTIONS-AGGREGATE-STATISTICS-TABLE (that take Y, X inputs that is) from that intermediate form.

A question is whether we want to just include these by default in our normal aggs, it does take extra space (6 DP numbers), but it makes the resulting intermediate form significantly more useful. We could also just define the regr forms to have all the functionality of the other form with additional functionality...

Counter partials, and interaction with PG Aggregates

There are several items that need to be addressed as we define the aggregates for counters here as well:

These will have combine functions, serialize, deserialize and final functions but will not be marked parallel safe. Which is a bit weird.
We need to keep the full list of values in the combine state until we have an entire contiguous region, this could get somewhat memory intensive. We also need to determine how this interacts with PG aggregation conventions and where we want to reduce the state to our partial form. Normally, in PG you might get away with doing this in the final function only, but I think that's not a great idea here: Namely, for continuous aggregates we'll need to do it on serialization (otherwise we're storing the entire data set). For partitionwise aggregates, we'll need to do it at the combine stage if there's a combination, and if the combine function doesn't get called (I dont' know if there are codepaths where the partial -> final) we'll need to do it at the finalfunc stage.
The FINALFUNCMODIFY flag will need to be set to SHAREABLE I think (see: https://www.postgresql.org/docs/current/sql-createaggregate.html#SQL-CREATEAGGREGATE-NOTES), because the finalfunc may
I think the proper solution here is to have an internal state which is either in expanded or reduced form and the first thing we do on the serialize, combine or final func is call reduce which will only act if it's not already reduced. The finalfunc actually would produce a defined type potentially then, rather than just internal/bytes.
setting finalfuncmodify to SHAREABLE will mean that this won't work in a window function context. We could create a separate window function having similar behavior (and only working on sets ordered by the time column, ie counter_window), which would have significantly more efficient behavior as it can keep one running list and it might be useful here, but I'm not 100% sure. I think we would need to keep the entire list of tuples, which would not be great memory wise, though, that's mainly if we want to support the inverse modes, but it could do cumulative diffs, which is a really nice feature. This might be something to save until we can do it on a timeseries, but not 100% sure.

Types of extrapolation

What types of extrapolation should we have? We'll probably have to have one for Prom specifically that maintains its (somewhat odd IMO) extrapolation behavior: https://github.com/prometheus/prometheus/blob/master/promql/functions.go#L59 . What else shoudl we have? Linear regression? (with counter offset?) If so that affects whether we want to include the regression forms in the standard internal representation. N.B. The reasoning behind their somewhat odd behavior is discussed in detail in this thread: https://github.com/prometheus/prometheus/pull/1161 we should understand that and some of the other choices available when deciding on other methods we might want to implement.

Resulting types

I think we'll want to store the finalized result of the aggs as a defined type in PG, the question here is how we store/treat the different data types that can live in both the time column and the value column. Are we going to have a combinatorial explosion? Do we treat counter/gauge types differently (obv, gauges have reset vals that are always zero, counters don't, but we don't need separate types? Does this have an implication for dealing with the extrapolation for the Prom reproduction? (or do we just do that with a flag in the extrapolation type so that we know we're dealing with a counter) etc.

Other Prom functionality

resets() and changes() are two Prom functions (defined on counters and both counters and gauges respectively, I'm not sure how often they're used, but it would be simple to keep track of for counters at least. For gauges, implementing the changes() function (resets will always be zero) would mean that the gauge_agg loses its parallelizability and would have the same requirements and memory use as counter_agg

Alternatives

We could define the functions directly on the value and time, so something like counter_rate(foo, time) however, I think this gets more confusing, especially if you get to a situation where you have a saturating counter and an extrapolated rate, both of which take parameters, it becomes less clear which parameters are referring to the rate step and which to the counter and I think it's a bit less elegant generally (function names get weird, it seems like things are different between counters and gauges, which Prom does, but I'm still not quite sure why/whether it's really necessary, we can still have reasonable defaults and explain what things do, but it seems a bit odd).

timescale / timescaledb-toolkit