prometheus / OpenMetrics

Evolving the Prometheus exposition format into a standard.
https://openmetrics.io
Apache License 2.0
2.37k stars 171 forks source link

stddev can be computed from counters #245

Closed andreimatei closed 2 years ago

andreimatei commented 2 years ago

I want to share an observation about maintaining the standard deviation of a set over time in a Prometheus-friendly way, having recently done some research about it. I don't claim any novel insight; I just want to check whether something is common knowledge in this group.

The standard has this to say about the Standard Deviation:

Standard deviation also falls into this category. Exposing a sum of squares as a counter would be the correct approach. It was not included in this standard as a Histogram value because 64bit floating point precision is not sufficient for this to work in practice. Due to the squaring only half the 53bit mantissa would be available in terms of precision. As an example a histogram observing 10k events per second would lose precision within 2 hours. Using 64bit integers would be no better due to the loss of the floating decimal point because a nanosecond resolution integer typically tracking events of a second in length would overflow after 19 observations. This design decision can be revisited when 128bit floating point numbers become common.

This stanza talks about "sum of squares", and complaints about such a sum's numerical properties. A better quantity numerically to maintain over time is the sum of squared deviations from the mean, which can be maintained in a streaming fashion through Welford's method. Such a sum was referenced in https://github.com/OpenObservability/OpenMetrics/issues/64, when discussing how the standard should relate to the stddev. That discussion seems to have had two quibbles with this quantity:

  1. That the sum of squared deviations "is not a counter".
  2. That it will still lose precision too quickly.

I'm not sure I agree with 2), since my organization does maintain such sums over some windows of times, and finds them useful. But what I want to discuss more is 1). This was not clear to me until I've done the algebra, but, if you maintain three quantities over time:

  1. the number of elements in the set, n
  2. the sum (or mean)
  3. the sum of squared deviations from the mean

and if you sample these quantities at t1 and then at t2, you can compute the mean over [t1,t2] and also the sum of squared deviations from the ([t1,t2]) mean. So, the sum of squared deviations is "subtractable", like a counter - with the caveat that you have to use the other two counters when doing the subtraction.

Was this a well known fact, or does it affect anyone's thinking about to what extent the OpenMetrics standard should say anything about the stddev? Sorry for the noise if this doesn't contribute anything.

brian-brazil commented 2 years ago

This was already considered in https://github.com/OpenObservability/OpenMetrics/issues/64#issuecomment-338722816, it has fatal precision issues unfortunately.

andreimatei commented 2 years ago

Thanks for verifying; the discussion in #64 had not been perfectly clear to me. Consider hinting in the text I've quoted to the fact that the sum of squared deviations was considered (a reading of the current paragraph might lead one to believe that the "sum of squares" it mentions are the squares of the observations).

Before I close this, let me kindly ask you: I know that you have fairly opinionated views on what quantities Prometheus should be tracking. I'm curious to know whether you see PromQL as being a good tool for combining multiple counters. In the case of the variance/stddev, assuming we're tracking:

a couple of notations for :

then:

M2_t1t2 = M2_t2 - M2_t1 - n1 * (m1)^2 - n2 * (m2)^2 + (n1 * m1 + n2 * m2)^2 / (n1 + n2)

My question is whether PromQL can express this (assuming you don't care about the precision issues). PromQL has subqueries, so I believe the m2 and n2 calculations can be done that way. But one problem is that I would like my M2_t1t2 calculation to deal with counter resets (in particular, resets of the M2 counter). For that, it seems to me that the whole calculation should be done in a built-in function that internally uses the right formulas for stitching things up at the reset points. Do you have any thoughts about whether PromQL can be used for these calculations as is, or the opportunity of adding stddev as a reset-correcting built-in?

brian-brazil commented 2 years ago

Don't track the mean, track the sum of all observation as that should be more accurate numerically and you can use the count to divide.

PromQL can combine counters and you don't need subqueries to do it, precision and other numerical stuff remains an issue. There's a much simpler way here as M2_t1t2 is rate(sum_of_squared_deviations[duration]).

We had agreed to make stddev doable in OpenMetrics, until the precision issues became apparent.

andreimatei commented 2 years ago

Don't track the mean, track the sum of all observation as that should be more accurate numerically and you can use the count to divide.

Indeed; I was just taking shortcuts in my writeup.

There's a much simpler way here as M2_t1t2 is rate(sum_of_squared_deviations[duration]).

Hmm, this doesn't sound right. Let's check that we're on the same page about what needs computed. Let's take an example of four measurements, taken at the 1m,2m,3m and 4m marks: [1,1,10,10].

t1=1m : observation: 1; running sum: 1; running mean: 1; running M2: 0 t2=2m : observation: 1; running sum: 2; running mean: 1; running M2: 0 t2=3m : observation: 10; running sum: 12; running mean: 4; running M2: (1-4)^2 + (1-4)^2 + (10-4)^2=9+9+36=54 t2=4m : observation: 10; running sum: 22; running mean: 5.5; running M2: (1-5.5)^2 + (1-5.5)^2 + (10-5.5)^2 + (10-5.5)^2=4*(4.5)^2=81

Let's say we want the stddev of the last two observations. The result we're looking for is 0. For that, we need to compute M2_3,4, as the sum of squared deviations over the last two observations. The deviations are from the mean of those two observations (which is 10). You say "M2_t1t2 is rate(sum_of_squared_deviations[duration])", but I don't see how that works out. We need M2_t1t2 to be 0. You need a correction factor, because you have to account for the fact that the mean you're computing against is not a quantity that was directly incorporated into the running M2s. The correct formula, I believe, is the one from my last message (I can share the math, if at all useful): M2_t1t2 = M2_t2 - M2_t1 - n1 * (m1)^2 - n2 * (m2)^2 + (n1 * m1 + n2 * m2)^2 / (n1 + n2) Indeed, this computes to the expected 0: For our purposes, t1 is 2m59s, t2=4m. M2_t1=0, M2_t2=81. n1 is the number of observations at t1 = 2 n2 is the number of observations at t2 minus n1 = 2 m1 is the median over the first n1 observations = 1 m2 is the median over the last n2 observations = 10 M2_t1t2 = 81 - 0 - 21^2 - 210^2 + (21 + 210)^2/4 = 0

Did I misunderstand what you were saying?

brian-brazil commented 2 years ago

I think I misunderstood what you were saying, that seems about right. The problem you'll face there is that you can't easily access those values - you really want something you can get from rate(). Thus a cumulative sum of squares is best. A sum of squares deviation will run into precision issues a bit less quickly, but not in a way that helps in practice.

On Thu 28 Apr 2022, 14:07 Andrei Matei, @.***> wrote:

Don't track the mean, track the sum of all observation as that should be more accurate numerically and you can use the count to divide.

Indeed; I was just taking shortcuts in my writeup.

There's a much simpler way here as M2_t1t2 is rate(sum_of_squared_deviations[duration]).

Hmm, this doesn't sound right. Let's check that we're on the same page about what needs computed. Let's take an example of four measurements, taken at the 1m,2m,3m and 4m marks: [1,1,10,10].

t1=1m : observation: 1; running sum: 1; running mean: 1; running M2: 0 t2=2m : observation: 1; running sum: 2; running mean: 1; running M2: 0 t2=3m : observation: 10; running sum: 12; running mean: 4; running M2: (1-4)^2 + (1-4)^2 + (10-4)^2=9+9+36=54 t2=4m : observation: 10; running sum: 22; running mean: 5.5; running M2: (1-5.5)^2 + (1-5.5)^2 + (10-5.5)^2 + (10-5.5)^2=4*(4.5)^2=81

Let's say we want the stddev of the last two observations. The result we're looking for is 0. For that, we need to compute M2_3,4, as the sum of squared deviations over the last two observations. The deviations are from the mean of those two observations (which is 10). You say "M2_t1t2 is rate(sum_of_squared_deviations[duration])", but I don't see how that works out. We need M2_t1t2 to be 0. You need a correction factor, because you have to account for the fact that the mean you're computing against is not a quantity that was directly incorporated into the running M2s. The correct formula, I believe, is the one from my last message (I can share the math, if at all useful): M2_t1t2 = M2_t2 - M2_t1 - n1 (m1)^2 - n2 (m2)^2 + (n1 m1 + n2 m2)^2 / (n1 + n2) Indeed, this computes to the expected 0: For our purposes, t1 is 2m59s, t2=4m. M2_t1=0, M2_t2=81. n1 is the number of observations at t1 = 2 n2 is the number of observations at t2 minus n1 = 2 m1 is the median over the first n1 observations = 1 m2 is the median over the last n2 observations = 10 M2_t1t2 = 81 - 0 - 21^2 - 210^2 + (21 + 210)^2/4 = 0

Did I misunderstand what you were saying?

— Reply to this email directly, view it on GitHub https://github.com/OpenObservability/OpenMetrics/issues/245#issuecomment-1112123604, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWJG5WZHOGX24PYYSYBMFLVHJ5QNANCNFSM5UNB5NYQ . You are receiving this because you commented.Message ID: @.***>

andreimatei commented 2 years ago

Thanks, Brian. For my edification, if you do want to go with the "sum of squared deviations" approach (as opposed to the "sum of squares of observations" approach), then do you agree that PromQL sub-queries can be used to implement our formula? Alternatively, I'm curious to know if there's any precedent for a PromQL built-in function that uses multiple timeseries, not just one (so, in this case, such a function would look like stddev(n, sum, sum_of_squared_deviations), thus operating on the three counters). I'm asking because I'm thinking about forking the Prometheus codebase (rather in the form of the Promscale codebase, which I think took a lot of Prometheus code) and get it to work for CockroachDB's internal needs. In this fork, we'd have the flexibility of both storing better precision than int64s and floats, and also the option to implement new PromQL built-in functions.

brian-brazil commented 2 years ago

I don't think that you need subqueries offhand, however it'll be very difficult to get it to work correctly in the general case given the mix of counters and gauges. As I said cumulative sum of squares will be far simpler to make work correctly, though you'll need 128bit floats in any case as the squaring halves the effective bits available.

There's none for three, only for two which have to be binary operators rather than functions so you can do vector matching.

sum_of_squared_deviations is a gauge.

On Thu 28 Apr 2022, 17:57 Andrei Matei, @.***> wrote:

Thanks, Brian. For my edification, if you do want to go with the "sum of squared deviations" approach (as opposed to the "sum of squares of observations" approach), then do you agree that PromQL sub-queries can be used to implement our formula? Alternatively, I'm curious to know if there's any precedent for a PromQL built-in function that uses multiple timeseries, not just one (so, in this case, such a function would look like stddev(n, sum, sum_of_squared_deviations), thus operating on the three counters). I'm asking because I'm thinking about forking the Prometheus codebase (rather in the form of the Promscale codebase, which I think took a lot of Prometheus code) and get it to work for CockroachDB's internal needs. In this fork, we'd have the flexibility of both storing better precision than int64s and floats, and also the option to implement new PromQL built-in functions.

— Reply to this email directly, view it on GitHub https://github.com/OpenObservability/OpenMetrics/issues/245#issuecomment-1112380264, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWJG5SVJZ745S7GWMVC653VHKYOJANCNFSM5UNB5NYQ . You are receiving this because you commented.Message ID: @.***>

andreimatei commented 2 years ago

Thanks!