grantjenks / python-runstats

Python module for computing statistics and regression in a single pass.
http://grantjenks.com/docs/runstats/
Other
96 stars 19 forks source link

Feature Request: Exponentially Weighted Incremental Mean and Variance #28

Open alanmazankiewicz opened 3 years ago

alanmazankiewicz commented 3 years ago

Hi, I'd like to propose the implementation of an exponentially weighted mean and variance calculation as running statistics over streaming data based on following paper Finch, 2009, Incremental Calculation of Weighted Mean and Variance, Eq: 124 (mean), 143 (variance). It derives incremental formulations for exponential mean and variance. If you are interested I'd be happy to implement it and submit a PR.

grantjenks commented 3 years ago

I’m open to it.

Can you explain what an exponentially weighted mean and variance would be used for? I’m not familiar with the applications.

Also please describe the API changes before doing the implementation.

alanmazankiewicz commented 3 years ago

Sure,

so first what is the idea of the exponential moving average (EMA). When you have a (theoretically unbound) stream of data you are often not (only) interested in a mean/variance value over the entire stream (since the first value arrived) but over "the recent past" (let's keep the exact meaning abstract for now). For example, if you start running now and then over the next 5 years tracking your pace of each run, you might be interested in knowing what your average pace is over multiple runs. Having a mean/variance statistic over the entire 5 years might not be that meaningful to you though since you improved your pace over time. You might be more interested in your "current" pace (e.g. last 6 months) as it reflect more accurately your current performance. So if all data is saved you could just calculate mean/variance over all runs of the last 6 months. A more "smooth" approach is calculating a weighted average where old values are weighted exponentially lower than more recent once. Instead of calculating each weight for each value yourself you just choose one discount factor (often termed alpha).

Coming to the applications, one big application is in finance and risk management (often in trading) (see e.g.: https://www.investopedia.com/terms/e/ema.asp, which also describes the concept in more depth). Another one is in predictive maintenance where one monitors sensory measurements describing the state of a machine (e.g. current pressure, temperature etc.). Since these measurements can often vary quite frequently one rather monitors a moving average over the measurements which is more stable. Here the concept of EMA comes in handy since we are interested in the "current" state, not the state over the entire lifetime. Also, in predictive maintenance the data stream is of high velocity (a lot of data coming in in a very short period of time). Therefore, it is not very feasible to store all the data which is where the incremental formulation comes in quite handy.

Regarding the API changes, I would have to look into more depth into you code to judge that. I come back to you about that later. First, I wanted to check if there is a general interest in a PR.

alanmazankiewicz commented 3 years ago

About the API changes: The running exponential mean and running exponential variance require a parameter alpha \in (0, 1). Therefore, the Statistics(iterable=()) constructor would need to be modified to accept an optional parameter alpha: Statistics(iterable=(), alpha=None). By making this parameter optional users of the current version or users who are not interested in exponential statistics don't encounter any changes in their usage. So, in the default case None the functionality wouldn't change. Otherwise if a suitable float value is supplied, on each .push() we also perform the incremental update of a new member variable exponential_eta and exponential_rho as described by the Finch paper. To return the current value of the exponential statistics a self.exponential_mean(), self.exponential_variance() and self.exponential_stddev() method would be implemented quite equivalently to the current (non-exponential) implementation. When making a Statisitics object from state I would handle both cases: (1) someone supplying the 7 current member variables, (2) someone supplying the 7 + 3 new member variables seamlessly.

grantjenks commented 3 years ago

Interesting. So there are exponential versions of mean, variance, etc. Since the calculations of those seem mostly independent of the non-exponential versions, I think it should be a different data type, maybe: ExponentialStatistics. Then the methods can use the simpler mean(), variance(), etc. names. Maybe there’s even some benefit from duck typing.

alanmazankiewicz commented 3 years ago

All right I'll implement it as another class.

grantjenks commented 3 years ago

Thanks. Pull request welcome.

alanmazankiewicz commented 3 years ago

Great. I was wondering: Some methods e.g. from_state(), __eq__(), __ne__() would be the same for both classes. Maybe it would be good to create an abstract base class that both classes inherit from and refactor common functionalities?

grantjenks commented 3 years ago

I wouldn’t bother with that initially. It could be a good refactoring but I want to update the way Cython is used first (it may conflict with the refactoring).

alanmazankiewicz commented 3 years ago

all right fair enough

grantjenks commented 3 years ago

I merged the code and spent some time with it last night. Two thoughts this morning:

  1. The name Exponential Statistics is a bit confusing. When I Google about it, everything is about exponential distributions. The phrase “exponential statistics” doesn’t seem to be a thing. I wonder if “exponential decay statistics” or simply “decay statistics” would be better.
  2. Seems like the “decay” could vary for every data point. I wonder if there’s a way to create a time series based exponential decay. I’m imagining an API where the decay is specified in seconds. Then the time the last data point was pushed would be recorded and when the next data point is pushed then the decay would be calculated as a function of the number of seconds elapsed. Does that make sense? It would allow for keeping a running average and emphasizing data points in the last “n” seconds.
alanmazankiewicz commented 3 years ago
  1. I see, you'r right. What about ExponentialMovingStatistics? If you google for that you find the concept immediately.

  2. I guess I get it. You want to decay the old mean/variance more, the older they are w.r.t. to time. So there would be a "base decay rate" that the user sets and each time one pushes a new value a "effective decay rate" gets dynamically calculated based on the time (microseconds / milliseconds / seconds / minutes / hours) by raising the base rate to the power of time. The effective decay rate is then used for calculating the new statistics. So if the base rate would be 0.9 and 2 seconds elapsed between the current value and the last value the effective decay rate would be 0.9^2 = 0.81. The old statistics get discounted by 0.81 and the new value gets weighted with 0.19. The next value arrives 5 seconds later so the effective decay rate is 0.9^5 ... And so on.

I think thats a good idea, it handles the use case where data comes in at a non-constant rate. I'd be happy to implement it.

So I imagine following API change: ExponentialStatistics(decay=0.9, initial_mean=0.0, initial_variance=0.0, iterable=(), time_based_decay = None) In case of None the class works as is (without time-based discounting). I think it is important to let the user decide what his base unit of time should be (seconds, minutes etc.) as this may vary greatly on the use case, in some applications data may arrive in seconds, in others in milliseconds etc. Theoretically, one could adjust for it by setting the base decay rate appropriately, but I don't think that is very user friendly. I guess a simple solution is to pass a string in ("microseconds" / "milliseconds" / "seconds" / "minutes" / "hours") (on my laptop a push using the fast implementation takes about 2 microseconds so going below that probably doesn't make much sense?). So in case of "seconds", a 30 seconds difference leads to decay^30 while in case of "minutes" it leads to decay^0.5. One could additionally let the user pass a function that takes two timestamps and outputs the appropriate value but I am not sure if this is really necessary. What do you think?

alanmazankiewicz commented 3 years ago

By the way, there is also a Exponentially Weighted Moving Covariance that could be implemented: https://stats.stackexchange.com/questions/6874/exponential-weighted-moving-skewness-kurtosis

The post actually deals with exponentially weighted higher moments which still seems to be an open question, but in the EDIT of the original post there is a formula for the incremental exponentially weighted moving covariance. The derivation of it from the standard covariance formula and the incremental exponentially weighted variance formula seems quite straight forward. I think this would also be interesting to implement. It would require a new class since we have to push X and Y at the same time. For calculating the covariance the exponential mean is required and for calculating a correlation out of it the variances as well so I could reuse the ExponentialStatistics class here.

alanmazankiewicz commented 3 years ago

Btw. I'd suggest to change following lines in the readme:

`

alpha_stats = ExponentialStatistics() for num in range(10): ... alpha_stats.push(num) beta_stats = ExponentialStatistics(decay=0.1) for num in range(10): ... beta_stats.push(num) exp_stats = alpha_stats + beta_stats exp_stats.decay = 0.9 exp_stats.mean() 0.0`

to:

`

alpha_stats = ExponentialStatistics(iterable=range(10)) beta_stats = ExponentialStatistics(decay=0.1) for num in range(10): ... beta_stats.push(num) exp_stats = alpha_stats 0.5 + beta_stats 0.5 exp_stats.decay 0.9 exp_stats.mean() 0.0`

This way we also demonstrate:

grantjenks commented 3 years ago

There are a few threads here. I'll try to get them all.

  1. Yes, ExponentialMovingStatistics is easily searched. I like it though it's a bit long. Change welcome.

  2. Yes, there would be a decay rate over a delay duration. So if I set decay=0.01 and delay=3600 then the current statistics decay to 1% of their original values over 1 hour. The dynamic decay rate would be "decay ** ((current time - previous time) / delay)".

  3. I would simply name the new parameter "delay". And yes, default None would preserve the current behavior. (You'll notice that I changed a few parameter names and made decay into a property.)

  4. I don't think it's worth worrying about units. Just use seconds like the time.time() API does. There are other libraries to help developers with unit conversions.

  5. Adding skewness and kurtosis to the exponential moving statistics is interesting but to be honest I've never used them in the original Statistics object.

  6. I updated the README with most of your suggestions and got the doctests working. I hadn't noticed before that add and multiply just apply to the mean and variance directly. That's quite a bit different than the Statistics object. Kinda feels like a gotcha. Would it make sense to track an effective "length" for the ExponentialStatistics object by simply adding 1 each time a new value is added and applying the decay to the previous value? Seems wrong otherwise to treat mean and variance like regular integers.

alanmazankiewicz commented 3 years ago

For (1 to 4 ) + 6: All right! For 5: Although the post is about skewness and kurtosis what I wanted to highlight there is the exponential covariance which is just a side node in the post (in the EDIT part of the question). I would find it interesting to implement a exponential covariance (and thus exponential correlation) based on the formulation presented in the post. It is a straight forward derivation from the formula in the Finch Paper and the definition of the covariance.

grantjenks commented 3 years ago

Fyi, I've just released v2.0.0 which includes the new way I use Cython annotations. There's no longer a fast.pyx. This makes things a bit simpler to maintain in the long run.