grantjenks / python-runstats

Python module for computing statistics and regression in a single pass.
http://grantjenks.com/docs/runstats/
Other
96 stars 19 forks source link

Support for removing elements #22

Closed paulo-raca closed 3 years ago

paulo-raca commented 6 years ago

Thank you for this library, I love it and I've learned a lot from it :heart:

Do you think it be possible to support removal of elements? I'm working on a sliding-windows problem, and this feature would be terrific!

I've found an implementation for this on https://lingpipe-blog.com/2009/07/07/welford-s-algorithm-delete-online-mean-variance-deviation/. It doesn't support kurtosis and skewness, but maybe it can be extended?

I've found an implementation for this on https://lingpipe-blog.com/2009/07/07/welford-s-algorithm-delete-online-mean-variance-deviation/, but it doesn't support kurtosis and skewness.

(Obviously Min and Max cannot work on this scenario)

grantjenks commented 6 years ago

Overall, sure, pull request welcome.

I’m not sure as to the best approach regarding the API and min/max or others that are no longer valid after a value is removed.

It would be nice to support skew and kurtosis if possible. You could try emailing John Cook and see if he’ll share the solutions with you (if there are solutions).

How big is your sliding window? Maybe you could just have “n” Statistics objects and throw away/start anew on every input.

I originally used this for hourly/daily aggregate statistics so creating a new object every day/hour was cheap.

paulo-raca commented 6 years ago

My window will be a few 1000s of elements, on a dataset with a few 1,000,000s of records. Not terribly large, but throw away/restart won't cut it.

Right now, my plan B is using hierarchical summaries: Compute a summary of every 2 records, then aggregate them to compute a summary every 4 records, then every 8, etc... And finally aggregate several sub-ranges to obtain the aggregation of the window. This approach is fast, numerically stable, but over-complicated and uses much more memory.

grantjenks commented 3 years ago

Pull request welcome.