Closed davidkohn88 closed 2 years ago
(edited for code formatting and to add syntax highlighting)
A preliminary version of the counter aggregate was released in 0.2.0 and should be available on Forge at the early next week.
Split remaining work into #391, #392, and #393.
What's the functionality you would like to add
The counter reset problem outlined in #4 and the rate/delta problem outlined in #8 can be solved, at least in part by choosing an intermediate aggregate that allows us to do part of the work before having access to the full data set.
There are 2 parts to the solution, 1) aggregates with a properly designed partial forms that can at least be partialized for disjoint time ranges (ones specific to counters and gauges) and 2) a set of functions designed to work on these partial forms to return the rate/delta and other functions outlined in issue #8.
Proposed aggregates:
counter_agg(value, time)
gauge_agg(value, time)
saturating_counter_agg(value, time, reset_val)
counter_agg(counter_agg_partial)
(implicitly performing the combine function on the aggregate, for re-aggregation purposes)Proposed functions to act on these aggregates:
n.b.:
last_val
is adjusted for resets if it is acounter_agg
(and not if it is agauge_agg
)rate(counter_agg(value, time))
-> the simple observed rate, so(last_val - first_val)/(last_time - first_time)
delta(counter_agg(value, time))
-> the simple observed delta of the valuelast_val - first_val
time_delta(counter_agg(value, time))
-> the simple observed delta of the timelast_time - first_time
(useful for determining how well sampled a time range isextrapolated_rate(counter_agg(value, time), range, method)
the rate, but with our values extrapolated out to somewhere near the edges of our bucket (the range provided), extrapolate using the method providedextrapolated_delta(counter_agg(value, time), range, method)
instantaneous_rate(counter_agg(value, time), edge)
-> the instantaneous rate ((val(n) - val(n-1))/(time(n)-time(n-1))
at either the leading or trailing edge of the range (default to trailing as that's Prom's behavior, though it seems a bit weird totime_bucket
which provides the value at the beginning of the range, and then provide the value at the end?instantaneous_delta(counter_agg(value, time), edge)
-> instantaneous delta analogous to aboveTechnical details on partials
The partial we create would be an intermediate form that stores:
reset_value = val_before_reset + val_after_reset
, which is also equal to the adjusted_counter value at the last reset in the range)ie for the example above:
We need the reset sum stored in order to determine if there was a counter reset when combining adjacent partials, so, if we were to combine this partial with another, we would use
last - reset_sum
as theraw_last
value to compare to thefirst
value of the next partial in order to determine if we had a counter reset at the partial boundary.As long as the ranges are disjoint and don't have gaps, you can then order by the first value of the range and calculate the full counter reset by essentially doing the normal counter reset calculations with the first value of the current and the last value of the preceeding and then summing the reset_sums.
This form is quite efficient as it represents an entire range in only 9 values. If you have as few points as our example above, then it's not a huge savings, but when you are reporting more frequently, this can be a huge savings.
The key requirement here is that the time ranges (and any other groups) must be disjoint (non-overlapping, because overlapping time ranges would mean we can't get a consistent ordering and counter resets could easily be double counted) and that there shouldn't be gaps (ie you can't just take the data from last week and the data from today and expect to get the right results, you need the data in the middle or you may miss counter resets).
These requirements are met by two of our important cases: continuous aggregates and partitionwise aggregates. In each case, the aggregations are guaranteed to be on disjoint sections of data (partitions are guaranteed to be disjoint as long as you group by any hash partitioned keys, and counter resets really only make sense at the most granular level) and a) continuous aggregate partials have a disjoint time range and in fact cannot cross chunk boundaries. (The lack of gaps is basically defined by not doing weird things with your where clauses, but you could create similar problems if you did it on the raw data, so we'll leave that to the side for now).
The basic aggregate would buffer the data from a time range, and then, when asked to serialize, would order the data, perform the proper calculations and then serialize this compact form and it can be stored in a continuous aggregate or it can be sent back to the access node as part of a partitionwise aggregate.
How would the function be used
Continuous aggregate:
You can the perform re-aggregation without waiting for continous aggs to support re-aggregation by, running
(This should also work without the with clause, you can nest the function calls if you want to in either case with something like
SELECT rate(counter_agg(foo, time)) as foo_rate, delta(counter_agg(foo, time)) as foo_delta
and it should only perform the aggregation once)Why should this feature be added? Is your feature request related to a problem? A wishlist item? Please describe.
See #4 & #8
Drawbacks
Counter aggs are not fully parallelizable (I'm not sure there is a fully parallelizable solution). (Gauge aggs are)
Separate aggregates from the functions for evaluation might be slightly confusing to users.
Open Questions
Regression Functions
Should we have separate aggregates that allow for regression analysis like with Prom's
deriv
function? i.e.:Proposed aggregates:
counter_regr(value, time)
gauge_regr(value, time)
saturating_counter_regr(value, time, reset_val)
counter_regr(counter_regr_partial)
(implicitly performing the combine function on the aggregate, for re-aggregation purposes)N (sample count), Sx=sum(X), Sxx=sum((X-Sx/N)^2), Sy=sum(Y), Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)
, in PG these are all floats, however, we might need to modify that a bit as our X values will be some repr of a timestamp type in many cases (ints might be better for some of these when possible)(Y-Sy/N)
, which is basically each value subtracted from the average of all values and if all values are shifted by a constant, then the average shifts by the same constant so it cancels out: i.e. If a constant C is added to each Y, then(Y-Sy/N)
becomes:( (Y + C) - (Sy + NC)/N) = (Y + C - Sy/N - NC/N) = (Y + C - Sy/N - C) = (Y - Sy/N)
.counter_regr
aggregate are going to be the same as for the normalcounter_agg
(because we need them in order to do the offsets correctly) except we add in the extra ones for regression and, I guess we can remove the second and penultimate values, so that the instantaneous bits won't work in that case, and I think that overhead for it to work for some things but not others seems odd...Proposed functions to act on these aggregates:
I think we can do all the ones here: https://www.postgresql.org/docs/12/functions-aggregate.html#FUNCTIONS-AGGREGATE-STATISTICS-TABLE (that take Y, X inputs that is) from that intermediate form.
A question is whether we want to just include these by default in our normal aggs, it does take extra space (6 DP numbers), but it makes the resulting intermediate form significantly more useful. We could also just define the regr forms to have all the functionality of the other form with additional functionality...
Counter partials, and interaction with PG Aggregates
There are several items that need to be addressed as we define the aggregates for counters here as well:
Types of extrapolation
What types of extrapolation should we have? We'll probably have to have one for Prom specifically that maintains its (somewhat odd IMO) extrapolation behavior: https://github.com/prometheus/prometheus/blob/master/promql/functions.go#L59 . What else shoudl we have? Linear regression? (with counter offset?) If so that affects whether we want to include the regression forms in the standard internal representation. N.B. The reasoning behind their somewhat odd behavior is discussed in detail in this thread: https://github.com/prometheus/prometheus/pull/1161 we should understand that and some of the other choices available when deciding on other methods we might want to implement.
Resulting types
I think we'll want to store the finalized result of the aggs as a defined type in PG, the question here is how we store/treat the different data types that can live in both the time column and the value column. Are we going to have a combinatorial explosion? Do we treat counter/gauge types differently (obv, gauges have reset vals that are always zero, counters don't, but we don't need separate types? Does this have an implication for dealing with the extrapolation for the Prom reproduction? (or do we just do that with a flag in the extrapolation type so that we know we're dealing with a counter) etc.
Other Prom functionality
resets()
andchanges()
are two Prom functions (defined on counters and both counters and gauges respectively, I'm not sure how often they're used, but it would be simple to keep track of for counters at least. For gauges, implementing thechanges()
function (resets will always be zero) would mean that thegauge_agg
loses its parallelizability and would have the same requirements and memory use ascounter_agg
Alternatives
We could define the functions directly on the value and time, so something like
counter_rate(foo, time)
however, I think this gets more confusing, especially if you get to a situation where you have a saturating counter and an extrapolated rate, both of which take parameters, it becomes less clear which parameters are referring to the rate step and which to the counter and I think it's a bit less elegant generally (function names get weird, it seems like things are different between counters and gauges, which Prom does, but I'm still not quite sure why/whether it's really necessary, we can still have reasonable defaults and explain what things do, but it seems a bit odd).