mathnet / mathnet-numerics

Math.NET Numerics
http://numerics.mathdotnet.com
MIT License
3.49k stars 896 forks source link

RunningStatistics Pull / Pop #264

Closed jaredbroad closed 9 years ago

jaredbroad commented 9 years ago

It would be great if running statistics also implemented a pop function, so you could push a window of say 5000 points and pop the old values off.

You'd need to store/queue the points, or to avoid memory issues allow the user to provide the value to pop off the end.

cuda commented 9 years ago

I started to tinker with this at https://github.com/cuda/mathnet-numerics/tree/moving_stats It is missing skewness and kurtosis. I couldn't find or come up with rolling algorithms for them. Anyone have a reference?

Some comments. 1) The code stores the current window of data. I think the user experience is better and for most cases, it won't be a memory issue. A window size of 1000 would have roughtly 8KB of overhead. 2) It is only lightly tested and I noticed it is slightly less accurate than RollingStatistics(RS). For a quick test, I set the window size to 5, pushed a million random numbers, and then set the last 5 values to 11.11, 22.22, 33.33, 44.44, 55.55. For the mean and variance, RS gives the correct ansers of 33.33 and 308.58025 while MS gives us 33.3300000000014 and 308.58025000001641 (still off even if we only push two values on first). 3) Would an option to recompute the statistics with each push instead of updating be useful? Optionally trade off speed for a little more accuracy.

Thanks

cdrnet commented 9 years ago

This would be very useful even without skewness and kurtosis as they are much less used anyway.

How do we want to deal with NaNs? For all other statistics we followed the standard behavior to return NaN if either not enough data is there for a particular measure or if at least one sample is NaN. But in the case of moving statistics, the NaN samples will move out of the window at some point. To be consistent we should return NaN while the NaN sample is still within the window, but not any longer. But for this to work we'd have to treat NaN samples specially, e.g. to track the "age" of the last NaN sample (to return NaN if the age is still within the window) and instead of updating the internal fields simply reset them when NaN is pushed.

1) It seems to me this is a requirement if we want to provide any kind of order statistics, including min/max.

jaredbroad commented 9 years ago

For your interest - the application will be passing streaming financial data through the library, and generating statistics for a range of periods. E.g. Generating variance for 1hr, 8hrs, 5 days, 30 days to calculate sharpe ratio over that period for https://github.com/QuantConnect/Lean

cdrnet commented 9 years ago

Related: http://mathnetnumerics.codeplex.com/discussions/622048

@cuda I've added a commit on top of yours with one way NaN could be handled in the moving_stats branch

cdrnet commented 9 years ago

@jaredbroad thanks for the context. Are the sample rates known to be fixed, so you can use fixed-length windows for 1h, 8h etc? With all the periods updated quasi-realtime (i.e. statistic measures change on each sample)?

cdrnet commented 9 years ago

We may need something similar to the NaN handling code to deal with +/- infinity as well.

jaredbroad commented 9 years ago

We'd use relatively fixed sampling periods (sampling equity every 2 seconds), with statistics updating on GUI maybe every minute or so.

20150217-live-server

cuda commented 9 years ago

How do we want to deal with NaNs? instead of updating the internal fields simply reset them when NaN is pushed.

Right, I didn't think about NaNs. That will work. Folks that want a different behavior such as skipping or interpolating can do so before pushing the data.

We may need something similar to the NaN handling code to deal with +/- infinity as well.

I'll work on that along some broader tests.

cuda commented 9 years ago

added support for negative and positive infinity.

works as:

  1. there is a positive infinity in the window with no NaNs or a negative infinity mean, variance, and std dev return are positive infinity, and max value is positive infinity
  2. there is a positive infinity in the window with no NaNs or a negative infinity mean = negative infinity, and variance and std dev are NaN, min value is negative infinity
  3. there is a positive and a negative infinity in the window with no NaN mean, variance, and std. dev are NaN, min value is negative infinity, and max value is positive infinity
  4. positive infinity and/or negative infinity with a NaN NaN rules apply

Took out skewness and kurtosis until I can figure out how to compute them

cdrnet commented 9 years ago

Thanks Marcus!

Released in v3.7.0.