janpipek / physt

Python histogram library - histograms as updateable, fully semantic objects with visualization tools. [P]ython [HYST]ograms.
MIT License
129 stars 15 forks source link

approximate histograms #58

Closed belm0 closed 1 year ago

belm0 commented 5 years ago

I'm following the paper (http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf) implemented by https://github.com/carsonfarmer/streamhist, and the notion of approximate histograms seems elegant and efficient.

After seeing the internals of streamhist (trying to fix bugs) and reading the paper, I can imagine ways to make a better implementation: e.g. much more efficient discovery of bins to be joined, and avoiding temporary lists when possible. Also the code seems overly complex, partially due to features like "bin freezing" which try to workaround poor bin joining performance.

Anyway since streamhist is defunct, I'm thinking about trying an implementation. I wonder if this kind of histogram would fit into physt (and if sortedcollections would be reasonable as a dependency).

belm0 commented 5 years ago

I've prototyped a new implementation of the Ben-Haim streaming histogram in pure Python, no package dependencies. Given a max bin count of 64 (streamhist default), adding a value takes 12 usec vs. 36 usec for streamhist (on 5 year old MacBook Air). That's a compelling speed given that the algorithm dynamically adjusts bins on every insert at steady state. It's possible to maintain that insert speed at much larger max bin counts (say 1,000), but the implementation gets significantly more complex (maintain a sorted container of bin distances).

More interesting is how flexible the algorithm is regarding bin merging strategies. The strategy in the Ben-Haim paper merges nearest bins, which I suppose preserves the most information about a distribution. However with a simple tweak it can instead maintain equal bin sizes-- ideal for arbitrary percentile queries. Then with some cheating I've made a "best effort" equal bin size strategy that is amortized O(1) regardless of max bin count, with fill running at 3 usec. Other merge strategies are possible, such as providing specific percentiles using a minim number of bins (e.g. accurate 50% and 90% using only 5 bins).

belm0 commented 5 years ago

implementation update

I improved the fill performance of the simple Python implementation and added a numba implementation:

(I wonder if physt would take on numba as extra_requires.)

I've implemented equivalent of numpy.quantile(). It matches quantile() output exactly until max_bins is reached, then transitions gracefully to approximate quantiles.

janpipek commented 1 year ago

Sorry, this will probably never be implemented :-/