Support for non-rectangular binning?

nsmith- commented 5 years ago

Particularly for CategoryBinning axes, it would be nice to only save a dense binning for the tuple of categories corresponding to a valid bin, rather than the entire product of category values per axis.

jpivarski commented 5 years ago

I thought about that—there are cases where you'd want to bin a space in a non-rectangular way. For instance, in x < 0 the y bins are finely spaced, in 0 <= x < 1, the y bins are broadly spaced, and in 1 <= x, the y bins cover a different range, etc.

This can be expressed in the current framework as a combination of Histograms and Collections. Both of these have a list of Axis that divide the space in a Cartesian product, but the Collections also define a set of children that do not have to have the same Axis list. That way, you could have a Collection of three Histograms: one finely binned in y, filled with x < 0, another widely binned in y with 0 <= x < 1, and another with y bins in a different range for 1 <= x. The non-rectangularness is expressible, though the user-facing library might call these a single histogram while Aghast calls it three Histograms.

Ah, but in that case, you'd really prefer the children of the Collection to be "named" with elements of a PredicateBinning, rather than strings. Maybe I should add a sibling of Collection that does that: instead of keying the things it contains with strings, it should key them with a binning. That would carry more semantic information.

jpivarski commented 5 years ago

I had been thinking about this, and although a sister to Collection would as the functionality in a backwards-compatible way, it would be simpler (and a breaking change) to generalize the Collection members from a string → objects mapping to a binning → objects, where the binning is usually CategoryBinning. The case you want would be PredicateBinning.

Since it's still the really days I'm going to change that. A lot of tests will need to be touched, but it will be worth it in the end.

The Axis system would be like this:

Collection has a sequence of Axis that are the outermost Cartesian splits.
Collection lookup has a single Axis that is a binning for its individually defined objects—one Histogram definition for each bin.
Histograms have a sequence of Axis that are the innermost Cartesian splits.

That way, you can build arbitrary nestings of "ands" and "ors" for splitting, by nesting several layers of Collections.

scikit-hep / aghast

Support for non-rectangular binning? #10