Closed andreimatei closed 2 years ago
This was already considered in https://github.com/OpenObservability/OpenMetrics/issues/64#issuecomment-338722816, it has fatal precision issues unfortunately.
Thanks for verifying; the discussion in #64 had not been perfectly clear to me. Consider hinting in the text I've quoted to the fact that the sum of squared deviations was considered (a reading of the current paragraph might lead one to believe that the "sum of squares" it mentions are the squares of the observations).
Before I close this, let me kindly ask you: I know that you have fairly opinionated views on what quantities Prometheus should be tracking. I'm curious to know whether you see PromQL as being a good tool for combining multiple counters. In the case of the variance/stddev, assuming we're tracking:
M2
as the the sum of squared deviationsn
as the number of observationsmean
as the mean of the observations
then, to compute the M2 over the window [t1,t2]
, we need the following formula:a couple of notations for :
n1
= n
at time t1
n2
as n
at time t2
- n1
m1
as the mean at time t1
m2
as the mean over the window [t1,t2]
M2_t1t2
as the sum of squared deviations from m2 over the window [t1,t2]
then:
M2_t1t2 = M2_t2 - M2_t1 - n1 * (m1)^2 - n2 * (m2)^2 + (n1 * m1 + n2 * m2)^2 / (n1 + n2)
My question is whether PromQL can express this (assuming you don't care about the precision issues). PromQL has subqueries, so I believe the m2
and n2
calculations can be done that way. But one problem is that I would like my M2_t1t2
calculation to deal with counter resets (in particular, resets of the M2
counter). For that, it seems to me that the whole calculation should be done in a built-in function that internally uses the right formulas for stitching things up at the reset points. Do you have any thoughts about whether PromQL can be used for these calculations as is, or the opportunity of adding stddev
as a reset-correcting built-in?
Don't track the mean, track the sum of all observation as that should be more accurate numerically and you can use the count to divide.
PromQL can combine counters and you don't need subqueries to do it, precision and other numerical stuff remains an issue. There's a much simpler way here as M2_t1t2 is rate(sum_of_squared_deviations[duration]).
We had agreed to make stddev doable in OpenMetrics, until the precision issues became apparent.
Don't track the mean, track the sum of all observation as that should be more accurate numerically and you can use the count to divide.
Indeed; I was just taking shortcuts in my writeup.
There's a much simpler way here as M2_t1t2 is rate(sum_of_squared_deviations[duration]).
Hmm, this doesn't sound right. Let's check that we're on the same page about what needs computed. Let's take an example of four measurements, taken at the 1m,2m,3m and 4m marks: [1,1,10,10].
t1=1m : observation: 1; running sum: 1; running mean: 1; running M2: 0 t2=2m : observation: 1; running sum: 2; running mean: 1; running M2: 0 t2=3m : observation: 10; running sum: 12; running mean: 4; running M2: (1-4)^2 + (1-4)^2 + (10-4)^2=9+9+36=54 t2=4m : observation: 10; running sum: 22; running mean: 5.5; running M2: (1-5.5)^2 + (1-5.5)^2 + (10-5.5)^2 + (10-5.5)^2=4*(4.5)^2=81
Let's say we want the stddev of the last two observations. The result we're looking for is 0.
For that, we need to compute M2_3,4, as the sum of squared deviations over the last two observations. The deviations are from the mean of those two observations (which is 10).
You say "M2_t1t2 is rate(sum_of_squared_deviations[duration])", but I don't see how that works out. We need M2_t1t2 to be 0.
You need a correction factor, because you have to account for the fact that the mean you're computing against is not a quantity that was directly incorporated into the running M2s. The correct formula, I believe, is the one from my last message (I can share the math, if at all useful):
M2_t1t2 = M2_t2 - M2_t1 - n1 * (m1)^2 - n2 * (m2)^2 + (n1 * m1 + n2 * m2)^2 / (n1 + n2)
Indeed, this computes to the expected 0:
For our purposes, t1 is 2m59s, t2=4m. M2_t1=0, M2_t2=81.
n1 is the number of observations at t1 = 2
n2 is the number of observations at t2 minus n1 = 2
m1 is the median over the first n1 observations = 1
m2 is the median over the last n2 observations = 10
M2_t1t2 = 81 - 0 - 21^2 - 210^2 + (21 + 210)^2/4 = 0
Did I misunderstand what you were saying?
I think I misunderstood what you were saying, that seems about right. The problem you'll face there is that you can't easily access those values - you really want something you can get from rate(). Thus a cumulative sum of squares is best. A sum of squares deviation will run into precision issues a bit less quickly, but not in a way that helps in practice.
On Thu 28 Apr 2022, 14:07 Andrei Matei, @.***> wrote:
Don't track the mean, track the sum of all observation as that should be more accurate numerically and you can use the count to divide.
Indeed; I was just taking shortcuts in my writeup.
There's a much simpler way here as M2_t1t2 is rate(sum_of_squared_deviations[duration]).
Hmm, this doesn't sound right. Let's check that we're on the same page about what needs computed. Let's take an example of four measurements, taken at the 1m,2m,3m and 4m marks: [1,1,10,10].
t1=1m : observation: 1; running sum: 1; running mean: 1; running M2: 0 t2=2m : observation: 1; running sum: 2; running mean: 1; running M2: 0 t2=3m : observation: 10; running sum: 12; running mean: 4; running M2: (1-4)^2 + (1-4)^2 + (10-4)^2=9+9+36=54 t2=4m : observation: 10; running sum: 22; running mean: 5.5; running M2: (1-5.5)^2 + (1-5.5)^2 + (10-5.5)^2 + (10-5.5)^2=4*(4.5)^2=81
Let's say we want the stddev of the last two observations. The result we're looking for is 0. For that, we need to compute M2_3,4, as the sum of squared deviations over the last two observations. The deviations are from the mean of those two observations (which is 10). You say "M2_t1t2 is rate(sum_of_squared_deviations[duration])", but I don't see how that works out. We need M2_t1t2 to be 0. You need a correction factor, because you have to account for the fact that the mean you're computing against is not a quantity that was directly incorporated into the running M2s. The correct formula, I believe, is the one from my last message (I can share the math, if at all useful): M2_t1t2 = M2_t2 - M2_t1 - n1 (m1)^2 - n2 (m2)^2 + (n1 m1 + n2 m2)^2 / (n1 + n2) Indeed, this computes to the expected 0: For our purposes, t1 is 2m59s, t2=4m. M2_t1=0, M2_t2=81. n1 is the number of observations at t1 = 2 n2 is the number of observations at t2 minus n1 = 2 m1 is the median over the first n1 observations = 1 m2 is the median over the last n2 observations = 10 M2_t1t2 = 81 - 0 - 21^2 - 210^2 + (21 + 210)^2/4 = 0
Did I misunderstand what you were saying?
— Reply to this email directly, view it on GitHub https://github.com/OpenObservability/OpenMetrics/issues/245#issuecomment-1112123604, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWJG5WZHOGX24PYYSYBMFLVHJ5QNANCNFSM5UNB5NYQ . You are receiving this because you commented.Message ID: @.***>
Thanks, Brian.
For my edification, if you do want to go with the "sum of squared deviations" approach (as opposed to the "sum of squares of observations" approach), then do you agree that PromQL sub-queries can be used to implement our formula?
Alternatively, I'm curious to know if there's any precedent for a PromQL built-in function that uses multiple timeseries, not just one (so, in this case, such a function would look like stddev(n, sum, sum_of_squared_deviations)
, thus operating on the three counters). I'm asking because I'm thinking about forking the Prometheus codebase (rather in the form of the Promscale codebase, which I think took a lot of Prometheus code) and get it to work for CockroachDB's internal needs. In this fork, we'd have the flexibility of both storing better precision than int64s and floats, and also the option to implement new PromQL built-in functions.
I don't think that you need subqueries offhand, however it'll be very difficult to get it to work correctly in the general case given the mix of counters and gauges. As I said cumulative sum of squares will be far simpler to make work correctly, though you'll need 128bit floats in any case as the squaring halves the effective bits available.
There's none for three, only for two which have to be binary operators rather than functions so you can do vector matching.
sum_of_squared_deviations is a gauge.
On Thu 28 Apr 2022, 17:57 Andrei Matei, @.***> wrote:
Thanks, Brian. For my edification, if you do want to go with the "sum of squared deviations" approach (as opposed to the "sum of squares of observations" approach), then do you agree that PromQL sub-queries can be used to implement our formula? Alternatively, I'm curious to know if there's any precedent for a PromQL built-in function that uses multiple timeseries, not just one (so, in this case, such a function would look like stddev(n, sum, sum_of_squared_deviations), thus operating on the three counters). I'm asking because I'm thinking about forking the Prometheus codebase (rather in the form of the Promscale codebase, which I think took a lot of Prometheus code) and get it to work for CockroachDB's internal needs. In this fork, we'd have the flexibility of both storing better precision than int64s and floats, and also the option to implement new PromQL built-in functions.
— Reply to this email directly, view it on GitHub https://github.com/OpenObservability/OpenMetrics/issues/245#issuecomment-1112380264, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWJG5SVJZ745S7GWMVC653VHKYOJANCNFSM5UNB5NYQ . You are receiving this because you commented.Message ID: @.***>
Thanks!
I want to share an observation about maintaining the standard deviation of a set over time in a Prometheus-friendly way, having recently done some research about it. I don't claim any novel insight; I just want to check whether something is common knowledge in this group.
The standard has this to say about the Standard Deviation:
This stanza talks about "sum of squares", and complaints about such a sum's numerical properties. A better quantity numerically to maintain over time is the sum of squared deviations from the mean, which can be maintained in a streaming fashion through Welford's method. Such a sum was referenced in https://github.com/OpenObservability/OpenMetrics/issues/64, when discussing how the standard should relate to the stddev. That discussion seems to have had two quibbles with this quantity:
I'm not sure I agree with 2), since my organization does maintain such sums over some windows of times, and finds them useful. But what I want to discuss more is 1). This was not clear to me until I've done the algebra, but, if you maintain three quantities over time:
and if you sample these quantities at
t1
and then att2
, you can compute the mean over[t1,t2]
and also the sum of squared deviations from the ([t1,t2]) mean. So, the sum of squared deviations is "subtractable", like a counter - with the caveat that you have to use the other two counters when doing the subtraction.Was this a well known fact, or does it affect anyone's thinking about to what extent the OpenMetrics standard should say anything about the stddev? Sorry for the noise if this doesn't contribute anything.