Closed henryiii closed 5 years ago
My thoughts:
Project Pros:
Project is a more common in histogramming, users might look for it first. Slightly fewer characters.
Integrate Pros:
This describes exactly what you are doing (integrating over the axis), so reads slightly better. Even more true when you set limits: 1:4:bh.integrate
integrates from bin 1 to bin 4. Boost.Histogram and ROOT have project functions, but you list the axes you want to keep - this highlights that you are picking axes to remove, and keeping axes with :
.
After being confused, I'm now slightly in the integrate camp.
Edit: I've been using UHI for so long that I've completely forgotten that Boost.Histogram does this correctly - you give remaining axes, not removed axes, exactly following the usage of the term. The comment I received must have been only about UHI's usage of project. I have no idea what I was thinking. Will correct the description above.
I think that "project" it's the more accurate term, mathematically.
The operation in question does integrate over a dimension of the distribution, but it "integrates completely," that is, from minus infinity to infinity, and it also changes the number of dimensions in the space under consideration. That's a special case of "integrate." I think that the histogramming library also has a "sum" for intermediate cases of integration (and there, "sum" is appropriate because it's limited to discrete measures—you have to start and stop on bin edges). If this sum does cover an entire dimension, it doesn't remove that dimension from consideration, making it a one-bin axis. (Sometimes, you want that.)
The use of "project" here is like geometrical perspective, or more generally, reducing a space to another space with a smaller number of dimensions. Mathematicians definitely use the word "project" for this—there's a whole field of "projective geometry" about spaces formed by normalizing all distances from a certain point (quantum mechanics, in which all wavefunctions are normalized, is an application of projective geometry; so is naked-eye astronomy, which sees angles but not distances). In linear algebra, if you apply a transformation of rank (n, m) where m < n, which takes an n-dimensional space and returns an m-dimensional space (I hope I haven't got that backward), it's called a projection.
The distinction between "project" and "integrate" is that projection always reduces the number of dimensions; "integration" might if it's from minus infinity to infinity and you choose to ignore the one-bin axis.
So as I understand it, the use of the word "project" is not a compromise, at odds with the mathematical usage, but is fully in line with the mathematical usage.
It does integrate (or projects) over a range: min:max:project
- if you leave off the values, it goes from -inf to inf. Integration in a mathematical context does remove a dimension - f(x,y)∫∞-∞dx is a function of only y, regardless of what start/end points you give, such as f(x,y)∫2-3dx. The possible enpoints are limited to bin edges and a finite number of bins is added, so yes, it is equivalent to a sum, but a) sum is a wider term that doesn't tend to be related to removing a dimension, b) we can't control sum, np.sum, and the other sum's out there. That's a nice point in favor of project, actually - integrate is not as common as sum, but is more common than project. SciPy has an integrate module, for example.
Sums only work on a whole histogram - sum returns a number, not a histogram.
One remaining bin is a rather odd, special case, and it is fully covered by the existing language: bh.rebin(size)
, where "size" is size of the current axis without flow bins. That's one of the benefits of having a composable language like UHI. You could come up with a mapping, bins A to bins B, but that's much harder to write (bin edges must map, etc) and can be covered by the general UHI rebin language (not supported in boost-histogram, at least any time soon).`
The operation is clearly a projection, but the question was if bh.integrate
describes the action on each axis better than bh.project
.
For example, if you have a 2D histogram h2
:
# from <any histogram library> import integrate, end
h2[::integrate, ::integrate] == h2.sum(flow=True)
h2[0:end:integrate, 0:end:integrate] == h2.sum(flow=False)
# from <any histogram library> import project, end
h2[::project, ::project] == h2.sum(flow=True)
h2[0:end:project, 0:end:project] == h2.sum(flow=False)
Note to self: There may be a bug in how the projection axes are collected in the current UHI implementation in boost-histogram, and we need to make sure if all axis are projected this returns a single value.
I'm in-between, I could go with either term. If no one votes to change it, we'll keep project.
What is "UHI"? (I did some searches, but didn't get relevant answers.)
Unified Histogram Indexing, a term invented to cover our proposal for histogram indexing.
Here's a little example so that the current naming can be seen in context:
h2 = bh.histogram(
bh.axis.regular(10, 0, 10),
bh.axis.regular(10, 0, 10),
bh.axis.regular(10, 0, 10),
)
h1 = bh.histogram(bh.axis.regular(10, 0, 10))
contents = [[2, 2, 2, 3, 4, 5, 6], [1, 2, 2, 3, 2, 1, 2], [-12, 33, 4, 9, 2, 4, 9]]
h1.fill(contents[0])
h2.fill(*contents)
assert h1 == h2[:, :: bh.project, :: bh.project]
assert h1 == h2[..., :: bh.project, :: bh.project]
assert h2.sum(flow=True) == h2[:: bh.project, :: bh.project, :: bh.project]
I mentioned it because I hit this problem with coffea histograms. I synonymized integrate
with project
, when really the desire is different, or at least it should be: project onto or integrate out.
What is "UHI"? (I did some searches, but didn't get relevant answers.)
Maybe for now you should write it out at least once per discussion, I had the same question as @jpivarski in another context. :)
Why the choice between integrate
and project
? I do not like either choice. Using project
as a keyword for the axes to remove is not intuitive for the reasons you mentioned. integrate
suggests to me that you actually integrate over the histogram, namely you compute sum(counts * widths)
, while you actually compute sum(counts)
. Therefore I like sum
as the keyword.
Also sum
is quick to type.
The downside is:
from boost_histogram import sum
Will override the builtin sum, or sum from numpy, pytorch, etc. And for these sorts of tags, it's going to be one of the most common ways to use it. project
and integrate
are a bit less common. loc
and end
might be a bit more common, but are at least not python builtins or popular numpy functions. underflow
and overflow
are really rare, I'd expect.
It might not be that bad to actually use Python's sum
as the marker, though... We'd just have to decide what to do with at least np.sum
- it could be defined that sum
from python does the built-in projection sum, whereas any other callable will do the sum over the bins. So np.sum
would work on simple data types, etc.
Namespaces exist because it is impossible to avoid name clashes. I think it is bad to try to avoid them. If we agree that sum
is the correct tag name then we should use it. People just have to get used to writing bh.sum
just like they write np.sum
. I am strongly opposed to the idea to use builtin sum
and numpy.sum
in a different way, assigning different semantics. I think we should not use sum
and numpy.sum
as tags.
I think the bin width argument against integrate is being unfair to users, who all know what their histogram represents: an emperical PDF. From this perspective, storing counts is interchangeable with storing bin-width-normalized counts and we expect integral
to do the same operation regardless of the underlying storage.
I think you do not have the right mental model of a histogram. A histogram is not an empirical PDF. You can convert a histogram into some kind of empirical PDF, but primarily a histogram is a data compression tool. We make histograms to fit them. It is a set of counts in intervals, that's it.
Storing counts is not interchangeable with storing bin-width-normalized counts, the latter requires superfluous additional computational steps. We will never provide a storage that stores density.
The right approach is to count during filling and convert to a density afterwards - if needed. And this is actually not needed for many use cases, in particular for fitting. Also for fitting, you want sums and not integrals.
This is how you properly fit a histogram: Your histogram has a count n in a cell. To fit your model PDF to that, you integrate over the PDF inside the cell and scale it with the total number of counts. You get a predicted count for that cell in this way. You can now compare the actual count with the predicted count using a Poisson likelihood. At no point in this algorithm you need the data density.
Sorry, I mean interchangeable in the sense that they encode the same information. sum
is fine to me as well.
People often like to visualize a histogram as a data density, e.g. when plotting also the model PDF on the same figure. When interpreting there, integral seems more natural to me. Of course when doing a fit you need to preserve the poisson property of binned data, but take for example RooFit, which does extra computational steps to interpret a histogram as a piecewise PDF.
Oh and also, I realized my view is through that of someone who rarely uses analytic models, so to me histograms are mostly used to bin simulation and form the model, hence they seem more like PDFs to me. Data histograms are of course less interpretable as PDFs given what is done to fit them.
Yes, in physics we often use a histogram as a simple estimator for the PDF, but my point is that this is only one of its uses, and it is not a use for which a histogram is really best suited. Kernel density estimates do a better job at PDF estimation, for example. Histograms are great for fitting, though, because it is easy to model the bin counts (Poisson, no bin correlations)
Current plan:
sum
(from built-ins) will be used as a special "tag" - if you write ::sum
, then this will trigger the fast, C++ projection over this axis.bh.sum
is just the builtin sum
np.sum
, will be called with the collections to be summed (ND array, one dimension per sum), and the result will be placed into a new histogram with those axes removed. This will mean that np.sum
will behave like sum
, save for speed/rounding error differences. (at first, we may make np.sum
a tag too until this is implemented).rebin
(name open for discussion), then it triggers the rebin machinery instead (as described in the current UHI proposal), and does not remove the axis.Thoughts/comments?
The histogram __getitem__
is choosing how to interpret objects in its tuple based on their names? (Fourth bullet point, for .rebin
.) I guess the intention is to let other libraries develop rebins, but this feels brittle.
I thought the main mode was to treat the objects in the tuple as functions with a particular signature, pass data to them (like old bin spacings) and get data back (like new bin spacings). Then it doesn't matter what the function is named, as there could be several functions with similar purposes and hence different names.
I thought the handling of things like Python's builtin sum
were exceptions, special cases for convenience, replacing sum
with something that has the right signature (though we could probably construe a protocol in which sum
already has the right signature).
.value
and .offset
, for example. I'm not checking the object's class name itself, which would be very brittle.I do think I made a mistake with the callable, though - you can see if the returned value is a scalar, and then that tells you if it is a rebin or a sum. If it has a .rebin
property, that triggers the built-in C++ rebin with the value the .rebin
property is set to.
We also might be able to remove the loc property checking soon, actually, a function should have reasonable performance.
What about Jim's suggestion to use strings for all of this? h[a:b:"sum"]
For reference, the proposal was in #152
I really do not like using strings. Reasons:
bh.sum
, but not a stringRemember that we're talking about a shortcut, not the primary/preferred way of doing it. We're arguing about whether strings or complex numbers are a bigger abuse, for the sake of conciseness. (The safety of eval is not relevant here because the user is already allowed to run anything.)
We could give up and not have a shortcut: require the user to type bh.loc(3.14)
every time. In fact, since this hasn't hit users yet, we're probably overthinking it and we should wait and see if users find the primary method to be too long-winded.
In this issue, we are not talking about the shortcut method. We are talking about "how to sum", the default/correct way - and I think that we can use Python's built-in sum directly, as well as provide it in bh just to make it discoverable.
I think you are both mixing issues, by the way.
I really like to use 5+-dimensional histograms. There's basically no way I can use UHI to reduce them to 1 or 2-d histograms because keeping track of which dimension I'm indexing is really opaque with UHI. Almost certainly I will use h.reduce('axname', slice(loc(3), loc(5))).reduce(...)
Yes, that's why reduce and project are there. UHI is best when you have a transform per dimension that you want to do (or when you can bundle ignored dimensions with ...
)
New proposal: Let's rename bh.project
to remove
.
It does remove an axis, and for a profile, it's not exactly a sum. You can't remove an axis without doing something with it - combining it using the accumulators seems to make sense.
The tag should be called sum
. remove
is not specific to what is happening with the axis.
~The should be no tag called project
!~ Sorry, this got mixed up with our other debate.
We have all agreed to rename project
, not sure why you reiterated that here. sum
is not very accurate for profile histograms, where it is taking a mean; while remove
is better, and is clear that it is removing this specific axis. You can always pass any user defined function (like np.sum
or np.mean
), which will do a specific process (once full UHI is implemented, and it will happen in Python).
No, remove
is bad. There is no hint what this does. sum
is clear when you have normal histograms. It is consistent with the behavior of what happens when you add histograms, which also combines the mean accumulators in a meaningful way. The guiding idea is that you can combine histograms from sub-datasets by adding them. sum
captures that.
Closed by #185.
Note that, once we fully implement UHI, sum
will work, whether we like it or not. np.sum
will, as well - any function that takes an array and returns a scalar is valid. My suggestion was simply to allow the python sum
to trigger the built-in machinery, making it fast and allowing it to work on all storages. (Implementing the full UHI requires that arrays of accumulators work, so #186 or similar is needed).
Also note that calling bh.sum
is identical to calling np.sum
, so from boost_histogram import *
is no worse than from numpy import *
. If we did allow the builtin sum to trigger the shortcut, we could then remove "sum"
from __all__
(which we don't have yet).
This is based on some feedback I received in a recent discussion:
In Boost.Histogram, mathematics, and other places, you associate project with the axes you project onto, rather than the axes you remove. The axes you remove are "integrated out". So
h project 1
would be expected to project an Nd histogram to the 1 axis, rather than integrate the 1 axis out.With that knowledge, should we provide a different name instead of
project
for UHI (which needs to move to a separate repository)? UHI's design means this only affects the axis you use it on -h[:,:,::project]
has two remaining axes, not one.We could provide both, as another option.
TL,DR:
h[::bh.project]
replaced by[::bh.remove]
. bh.project will exist for a while, but provide a DepreciationWarning.@jpivarski, @HDembinski, @benkrikler, thoughts ?