scikit-hep / aghast

Aghast: aggregated, histogram-like statistics, sharable as Flatbuffers.
BSD 3-Clause "New" or "Revised" License
17 stars 8 forks source link

Slicing categorical axis #21

Closed rob-tay closed 5 years ago

rob-tay commented 5 years ago

Unless I'm misunderstanding how the slicing should work, it seems like slicing on a categorical axis when there is also another axis, does not have the desired effect:

h = aghast.Histogram([aghast.Axis(aghast.CategoryBinning(["cat1", "cat2"])),
                      aghast.Axis(aghast.RegularBinning(10, aghast.RealInterval(-5, 5)))],
                    aghast.UnweightedCounts(
                        aghast.InterpretedInlineBuffer.fromarray(
                            numpy.array([[  9,  25,  29,  35,  54,  67,  60,  84,  80,  94],
                                         [ 99, 119, 109, 109,  95, 104, 102, 106, 112, 122]]))))

If I wanted to select only the counts for "cat2" I assumed this would work:

h_cat2 = h.loc['cat2']

However, that just produces a histogram with the counts for both categories:

h_cat2.counts.array
array([[ 99, 119, 109, 109,  95, 104, 102, 106, 112, 122],
       [  9,  25,  29,  35,  54,  67,  60,  84,  80,  94]])
jpivarski commented 5 years ago

First a word of warning; following a long chain of discussion on https://gitter.im/HSF/PyHEP-histogramming , the loc, iloc, and __getitem__ method on counts will be removed. Aghast was overstepping its scope as a format converter, and this job will be done by histogramming libraries likke boost-histogram and hist. We've been talking a lot about what a good syntax for that would be.

Staring at your example, you had me convinced for a while that this was a bug, but actually, it's not. It comes from the fact that some ("low-level") array views include under/overflow bins and some ("high-level") only include them if you ask, where you ask for them to be put. You used the low-level array view. The selection "cat2" puts all other categories (there's only one, "cat1") into an overflow bin. To use the high-level view, do counts[:] (no overflow) or counts[:numpy.inf] (with overflow).

>>> h.loc["cat2"].counts[:]
array([[ 99, 119, 109, 109,  95, 104, 102, 106, 112, 122]])
>>> h.loc["cat2"].counts[:numpy.inf]
array([[ 99, 119, 109, 109,  95, 104, 102, 106, 112, 122],
       [  9,  25,  29,  35,  54,  67,  60,  84,  80,  94]])

If we're using these selections to perform format conversions, we'll need to transition to some method that would be used "internally" for the format conversions only (not users).