scikit-hep / aghast

Aghast: aggregated, histogram-like statistics, sharable as Flatbuffers.
BSD 3-Clause "New" or "Revised" License
17 stars 8 forks source link

Q: Sparse-ness and merging for IntegerBinning and CategoryBinning #20

Closed benkrikler closed 5 years ago

benkrikler commented 5 years ago

I've been working on adapting our pandas-based histogram filling code to produce aghasts, and was trying to understand IntegerBinning and CategoryBinning better.

In our current version that produces pandas dataframes, if a user asks to bin on a variable but doesn't provide a binning scheme, that's interpreted as implying the variable to already be discrete (ie. integral, categorical, or a finite number of floats-- the latter is a little dangerous, but so far not caught anyone out). This works nicely, in that we can distribute the filling jobs with such configurations, produce sparse binned dataframes, where bins are only present if at least one instance of its value occurred, and then merge these together.

Switching to aghast I can partially solve this by deducing all possible categories or the min / max value for of an integer variable for each chunk and only then create the Ghast, but then I can't be sure that these will be the same between distributed chunks of data.

I've had a look in the aghast code and the specification, and it looks to me like IntegerBinning isn't a sparse implementation. I've also looked at the merging of histograms (the _add method in particular) but struggled to understand the merging process for such a situation. What would happen ff two Histograms are combined, both with an IntegerBinning axis but with different min/max values? Will the result be the union of these axes or the intersection? And similarly for the set of values in two CategoricalBinning axes, would they be unioned or intersected if we add such histograms?

(I wasn't sure whether this should be a question on gitter, email, or an issue, but since it's potentially a feature request, I thought this might be the best place; please say if you prefer I use somewhere for such questions in the future! :smile: )

jpivarski commented 5 years ago

This is a good place for questions! (Though I wish GitHub further subdivided issues/PRs into issues/PRs/questions, since it can be convenient to keep the "something's broken" distinct from the "tell me about...")

The intended behavior is that + will merge to the minimum interval that covers all. That's why the implementation is complicated. Let me try it on a few examples to either illustrate or discover a bug. :)

>>> import aghast, numpy
>>> h1 = aghast.Histogram([aghast.Axis(aghast.IntegerBinning(0, 9))],
...     aghast.UnweightedCounts(aghast.InterpretedInlineBuffer.fromarray(numpy.ones(10))))
>>> h2 = aghast.Histogram([aghast.Axis(aghast.IntegerBinning(5, 14))],
...     aghast.UnweightedCounts(aghast.InterpretedInlineBuffer.fromarray(numpy.ones(10))))
>>> hsum = h1 + h2
>>> hsum.axis[0].binning.toCategoryBinning().categories
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14']
>>> hsum.counts.array
array([1., 1., 1., 1., 1., 2., 2., 2., 2., 2., 1., 1., 1., 1., 1.])

All binning types have toOtherBinning() methods, so converting to CategoryBinning is a little trick to get the bins in a human-readable format (strings). Here, we see that the one binning from 0 to 9 inclusive (IntegerBinning's min and max are inclusive on both ends) plus the other binning from 5 to 14 yields a binning from 0 to 14, and the middle 5 overlap (count of 2, rather than 1).

Now if there's a gap,

>>> h3 = aghast.Histogram([aghast.Axis(aghast.IntegerBinning(20, 29))],
...     aghast.UnweightedCounts(aghast.InterpretedInlineBuffer.fromarray(numpy.ones(10))))
>>> hsum2 = h1 + h3
>>> hsum2.axis[0].binning.toCategoryBinning().categories
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15',
 '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29']
>>> hsum2.counts.array
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

This should also work for other binning types, such as CategoryBinning (which must also find a mutually compatible ordering—I remember spending time on that to make the ordering stable, in case the order of the categories in the left addend means something). I'm not going to demonstrate them all here because there are unit tests for that (tests/test_add.py).

This merging behavior is inspired by Pandas, but nonexistent bins are filled with 0, not nan and some axis types are dense (e.g. IntegerBinning is a dense interval, as you can see above, but CategoryBinning and SparseRegularBinning are not), even when multidimensional. (Pandas's MultiIndex is effectively sparse in all dimensions.)

The TLDR is: you have to know the set of integers or categories before you construct a thread-local ghast, but not before combining all thread-local gasts into a global ghast. Whatever data structure you're using for filling is not a ghast, but for merging, you use ghasts. So, you fill → convert to Aghast → combine all thread-local copies → convert to your favorite plotting format.