scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

Bayesian Blocks #5

Closed cranmer closed 6 years ago

cranmer commented 6 years ago

Any thoughts about this type of data-dependent binning?

http://docs.astropy.org/en/stable/api/astropy.stats.bayesian_blocks.html

https://jakevdp.github.io/blog/2012/09/12/dynamic-programming-in-python/

jpivarski commented 6 years ago

Histogrammar had data-dependent binning— at first. The problem was that I couldn't find any algorithms that could fill in parallel processes without communication and then combine in a way that is independent of how the data have been distributed (associative property).

Some, like a clustering algorithm to continuously merge the most populous bins (which I nabbed from the internals of Hive), were approximately independent of how the data were divided, but I had some extreme parallelism scenarios like the GPU where artifacts were visible. I decided to stick to conservatively exact algorithms (which only assume associativity of floating point arithmetic, which is also not really exact but a much better approximation).

histbook has an additional performance consideration: whereas Histogrammar's bins were Python objects floating around anywhere in memory, histbook uses a contiguous Numpy array for all axes except groupby and groupbin, which are data-determined (bins only exist if data in them are observed) but associative. They're implemented with dicts. There ought to be a significant performance drop between, say, bin and groupbin in speed and memory use. The same would be true or even worse for more exotic data-driven binnings.

Unless there's a strong need to give up "exact" associativity and also accept more low-performance bonus binning methods, I'd like to stick to the fixed memory layout ones.

cranmer commented 6 years ago

That's what I expected, and I think it's totally reasonable. But it may be worth some Q/A-style comment in the documentation.

jpivarski commented 6 years ago

I'm not sure what you mean— feedback from the documentation? Maybe this issue would be a good place to put a conversation, in which case, I'd open it and label it a feature request or something. (Never used labels before...)

cranmer commented 6 years ago

I just meant in the style of What histbook is / isn't. Or a FAQ section in the documentation. Some users might be hoping / expecting functionality like Bayesian Blocks (eg. as in satrapy).