bh.project rename - Githubissues

henryiii commented 5 years ago

This is based on some feedback I received in a recent discussion:

In Boost.Histogram, mathematics, and other places, you associate project with the axes you project onto, rather than the axes you remove. The axes you remove are "integrated out". So h project 1 would be expected to project an Nd histogram to the 1 axis, rather than integrate the 1 axis out.

With that knowledge, should we provide a different name instead of project for UHI (which needs to move to a separate repository)? UHI's design means this only affects the axis you use it on - h[:,:,::project] has two remaining axes, not one.

We could provide both, as another option.

TL,DR: h[::bh.project] replaced by [::bh.remove]. bh.project will exist for a while, but provide a DepreciationWarning.

@jpivarski, @HDembinski, @benkrikler, thoughts ?

henryiii commented 5 years ago

My thoughts:

Project Pros:

Project is a more common in histogramming, users might look for it first. Slightly fewer characters.

Integrate Pros:

This describes exactly what you are doing (integrating over the axis), so reads slightly better. Even more true when you set limits: 1:4:bh.integrate integrates from bin 1 to bin 4. Boost.Histogram and ROOT have project functions, but you list the axes you want to keep - this highlights that you are picking axes to remove, and keeping axes with :.

After being confused, I'm now slightly in the integrate camp.

henryiii commented 5 years ago

Edit: I've been using UHI for so long that I've completely forgotten that Boost.Histogram does this correctly - you give remaining axes, not removed axes, exactly following the usage of the term. The comment I received must have been only about UHI's usage of project. I have no idea what I was thinking. Will correct the description above.

jpivarski commented 5 years ago

I think that "project" it's the more accurate term, mathematically.

The operation in question does integrate over a dimension of the distribution, but it "integrates completely," that is, from minus infinity to infinity, and it also changes the number of dimensions in the space under consideration. That's a special case of "integrate." I think that the histogramming library also has a "sum" for intermediate cases of integration (and there, "sum" is appropriate because it's limited to discrete measures—you have to start and stop on bin edges). If this sum does cover an entire dimension, it doesn't remove that dimension from consideration, making it a one-bin axis. (Sometimes, you want that.)

The use of "project" here is like geometrical perspective, or more generally, reducing a space to another space with a smaller number of dimensions. Mathematicians definitely use the word "project" for this—there's a whole field of "projective geometry" about spaces formed by normalizing all distances from a certain point (quantum mechanics, in which all wavefunctions are normalized, is an application of projective geometry; so is naked-eye astronomy, which sees angles but not distances). In linear algebra, if you apply a transformation of rank (n, m) where m < n, which takes an n-dimensional space and returns an m-dimensional space (I hope I haven't got that backward), it's called a projection.

The distinction between "project" and "integrate" is that projection always reduces the number of dimensions; "integration" might if it's from minus infinity to infinity and you choose to ignore the one-bin axis.

So as I understand it, the use of the word "project" is not a compromise, at odds with the mathematical usage, but is fully in line with the mathematical usage.

henryiii commented 5 years ago

It does integrate (or projects) over a range: min:max:project - if you leave off the values, it goes from -inf to inf. Integration in a mathematical context does remove a dimension - f(x,y)∫^∞_-∞dx is a function of only y, regardless of what start/end points you give, such as f(x,y)∫²_-3dx. The possible enpoints are limited to bin edges and a finite number of bins is added, so yes, it is equivalent to a sum, but a) sum is a wider term that doesn't tend to be related to removing a dimension, b) we can't control sum, np.sum, and the other sum's out there. That's a nice point in favor of project, actually - integrate is not as common as sum, but is more common than project. SciPy has an integrate module, for example.

Sums only work on a whole histogram - sum returns a number, not a histogram.

One remaining bin is a rather odd, special case, and it is fully covered by the existing language: bh.rebin(size), where "size" is size of the current axis without flow bins. That's one of the benefits of having a composable language like UHI. You could come up with a mapping, bins A to bins B, but that's much harder to write (bin edges must map, etc) and can be covered by the general UHI rebin language (not supported in boost-histogram, at least any time soon).`

The operation is clearly a projection, but the question was if bh.integrate describes the action on each axis better than bh.project.

For example, if you have a 2D histogram h2:

# from <any histogram library> import integrate, end
h2[::integrate, ::integrate] == h2.sum(flow=True)
h2[0:end:integrate, 0:end:integrate] == h2.sum(flow=False)

# from <any histogram library> import project, end
h2[::project, ::project] == h2.sum(flow=True)
h2[0:end:project, 0:end:project] == h2.sum(flow=False)

Note to self: There may be a bug in how the projection axes are collected in the current UHI implementation in boost-histogram, and we need to make sure if all axis are projected this returns a single value.

I'm in-between, I could go with either term. If no one votes to change it, we'll keep project.

jpivarski commented 5 years ago

What is "UHI"? (I did some searches, but didn't get relevant answers.)

henryiii commented 5 years ago

Unified Histogram Indexing, a term invented to cover our proposal for histogram indexing.

henryiii commented 5 years ago

Here's a little example so that the current naming can be seen in context:

h2 = bh.histogram(
        bh.axis.regular(10, 0, 10),
        bh.axis.regular(10, 0, 10),
        bh.axis.regular(10, 0, 10),
    )
    h1 = bh.histogram(bh.axis.regular(10, 0, 10))

    contents = [[2, 2, 2, 3, 4, 5, 6], [1, 2, 2, 3, 2, 1, 2], [-12, 33, 4, 9, 2, 4, 9]]

    h1.fill(contents[0])
    h2.fill(*contents)

    assert h1 == h2[:, :: bh.project, :: bh.project]
    assert h1 == h2[..., :: bh.project, :: bh.project]
    assert h2.sum(flow=True) == h2[:: bh.project, :: bh.project, :: bh.project]

nsmith- commented 5 years ago

I mentioned it because I hit this problem with coffea histograms. I synonymized integrate with project, when really the desire is different, or at least it should be: project onto or integrate out.

HDembinski commented 5 years ago

What is "UHI"? (I did some searches, but didn't get relevant answers.)

Maybe for now you should write it out at least once per discussion, I had the same question as @jpivarski in another context. :)

HDembinski commented 5 years ago

Why the choice between integrate and project? I do not like either choice. Using project as a keyword for the axes to remove is not intuitive for the reasons you mentioned. integrate suggests to me that you actually integrate over the histogram, namely you compute sum(counts * widths), while you actually compute sum(counts). Therefore I like sum as the keyword.

HDembinski commented 5 years ago

Also sum is quick to type.

henryiii commented 5 years ago

The downside is:

from boost_histogram import sum

Will override the builtin sum, or sum from numpy, pytorch, etc. And for these sorts of tags, it's going to be one of the most common ways to use it. project and integrate are a bit less common. loc and end might be a bit more common, but are at least not python builtins or popular numpy functions. underflow and overflow are really rare, I'd expect.

henryiii commented 5 years ago

It might not be that bad to actually use Python's sum as the marker, though... We'd just have to decide what to do with at least np.sum - it could be defined that sum from python does the built-in projection sum, whereas any other callable will do the sum over the bins. So np.sum would work on simple data types, etc.

HDembinski commented 5 years ago

Namespaces exist because it is impossible to avoid name clashes. I think it is bad to try to avoid them. If we agree that sum is the correct tag name then we should use it. People just have to get used to writing bh.sum just like they write np.sum. I am strongly opposed to the idea to use builtin sum and numpy.sum in a different way, assigning different semantics. I think we should not use sum and numpy.sum as tags.

nsmith- commented 5 years ago

I think the bin width argument against integrate is being unfair to users, who all know what their histogram represents: an emperical PDF. From this perspective, storing counts is interchangeable with storing bin-width-normalized counts and we expect integral to do the same operation regardless of the underlying storage.

HDembinski commented 5 years ago

I think you do not have the right mental model of a histogram. A histogram is not an empirical PDF. You can convert a histogram into some kind of empirical PDF, but primarily a histogram is a data compression tool. We make histograms to fit them. It is a set of counts in intervals, that's it.

Storing counts is not interchangeable with storing bin-width-normalized counts, the latter requires superfluous additional computational steps. We will never provide a storage that stores density.

The right approach is to count during filling and convert to a density afterwards - if needed. And this is actually not needed for many use cases, in particular for fitting. Also for fitting, you want sums and not integrals.

HDembinski commented 5 years ago

This is how you properly fit a histogram: Your histogram has a count n in a cell. To fit your model PDF to that, you integrate over the PDF inside the cell and scale it with the total number of counts. You get a predicted count for that cell in this way. You can now compare the actual count with the predicted count using a Poisson likelihood. At no point in this algorithm you need the data density.

nsmith- commented 5 years ago

Sorry, I mean interchangeable in the sense that they encode the same information. sum is fine to me as well. People often like to visualize a histogram as a data density, e.g. when plotting also the model PDF on the same figure. When interpreting there, integral seems more natural to me. Of course when doing a fit you need to preserve the poisson property of binned data, but take for example RooFit, which does extra computational steps to interpret a histogram as a piecewise PDF.

nsmith- commented 5 years ago

Oh and also, I realized my view is through that of someone who rarely uses analytic models, so to me histograms are mostly used to bin simulation and form the model, hence they seem more like PDFs to me. Data histograms are of course less interpretable as PDFs given what is done to fit them.

HDembinski commented 5 years ago

Yes, in physics we often use a histogram as a simple estimator for the PDF, but my point is that this is only one of its uses, and it is not a use for which a histogram is really best suited. Kernel density estimates do a better job at PDF estimation, for example. Histograms are great for fitting, though, because it is easy to model the bin counts (Poisson, no bin correlations)

henryiii commented 5 years ago

Current plan:

sum (from built-ins) will be used as a special "tag" - if you write ::sum, then this will trigger the fast, C++ projection over this axis.
bh.sum is just the builtin sum
Any other callable, like np.sum, will be called with the collections to be summed (ND array, one dimension per sum), and the result will be placed into a new histogram with those axes removed. This will mean that np.sum will behave like sum, save for speed/rounding error differences. (at first, we may make np.sum a tag too until this is implemented)
If the object has a attribute .rebin (name open for discussion), then it triggers the rebin machinery instead (as described in the current UHI proposal), and does not remove the axis.

henryiii commented 5 years ago

Thoughts/comments?

jpivarski commented 5 years ago

The histogram __getitem__ is choosing how to interpret objects in its tuple based on their names? (Fourth bullet point, for .rebin.) I guess the intention is to let other libraries develop rebins, but this feels brittle.

I thought the main mode was to treat the objects in the tuple as functions with a particular signature, pass data to them (like old bin spacings) and get data back (like new bin spacings). Then it doesn't matter what the function is named, as there could be several functions with similar purposes and hence different names.

I thought the handling of things like Python's builtin sum were exceptions, special cases for convenience, replacing sum with something that has the right signature (though we could probably construe a protocol in which sum already has the right signature).

henryiii commented 5 years ago

No, it looks for a property with a specific name. That's how everything else in UHI works to trigger the "fast" or "built-in" way of doing things. Loc has a .value and .offset, for example. I'm not checking the object's class name itself, which would be very brittle.
The pass-in, pass-out method is much slower, as it cannot call the C++ project/rebin/crop/..., so the "if attribute method is present, do this" is used to provide a way to do the fast method. The slow method needs to live beside the attribute methods. I wouldn't call this the "main mode", but rather the "flexible" or "extensible" mode.
This is a little odd, and special for sum, but sum is a Python builtin, so I think it is usable - it's just a replacement for having an object with a special attribute. Any other callable works as the pass-in, pass-out.

I do think I made a mistake with the callable, though - you can see if the returned value is a scalar, and then that tells you if it is a rebin or a sum. If it has a .rebin property, that triggers the built-in C++ rebin with the value the .rebin property is set to.

We also might be able to remove the loc property checking soon, actually, a function should have reasonable performance.

HDembinski commented 5 years ago

What about Jim's suggestion to use strings for all of this? h[a:b:"sum"]

henryiii commented 5 years ago

For reference, the proposal was in #152

I really do not like using strings. Reasons:

Not discoverable. You can find bh.sum, but not a string
Not pythonic. Using strings generally means you couldn't come up with an EDSL and just gave up and used strings. Libraries like Numba are popular because they didn't just use strings.
Eval is evil and a security risk and would be too tempting to use here (at least for some parts). What if someone wants to write a GUI for histograms and stick it up online? Like the new ROOT graphics?

jpivarski commented 5 years ago

Remember that we're talking about a shortcut, not the primary/preferred way of doing it. We're arguing about whether strings or complex numbers are a bigger abuse, for the sake of conciseness. (The safety of eval is not relevant here because the user is already allowed to run anything.)

We could give up and not have a shortcut: require the user to type bh.loc(3.14) every time. In fact, since this hasn't hit users yet, we're probably overthinking it and we should wait and see if users find the primary method to be too long-winded.

henryiii commented 5 years ago

In this issue, we are not talking about the shortcut method. We are talking about "how to sum", the default/correct way - and I think that we can use Python's built-in sum directly, as well as provide it in bh just to make it discoverable.

I think you are both mixing issues, by the way.

nsmith- commented 5 years ago

I really like to use 5+-dimensional histograms. There's basically no way I can use UHI to reduce them to 1 or 2-d histograms because keeping track of which dimension I'm indexing is really opaque with UHI. Almost certainly I will use h.reduce('axname', slice(loc(3), loc(5))).reduce(...)

henryiii commented 5 years ago

Yes, that's why reduce and project are there. UHI is best when you have a transform per dimension that you want to do (or when you can bundle ignored dimensions with ...)

henryiii commented 5 years ago

New proposal: Let's rename bh.project to remove.

It does remove an axis, and for a profile, it's not exactly a sum. You can't remove an axis without doing something with it - combining it using the accumulators seems to make sense.

HDembinski commented 5 years ago

The tag should be called sum. remove is not specific to what is happening with the axis.

HDembinski commented 5 years ago

~The should be no tag called project!~ Sorry, this got mixed up with our other debate.

henryiii commented 5 years ago

We have all agreed to rename project, not sure why you reiterated that here. sum is not very accurate for profile histograms, where it is taking a mean; while remove is better, and is clear that it is removing this specific axis. You can always pass any user defined function (like np.sum or np.mean), which will do a specific process (once full UHI is implemented, and it will happen in Python).

HDembinski commented 5 years ago

No, remove is bad. There is no hint what this does. sum is clear when you have normal histograms. It is consistent with the behavior of what happens when you add histograms, which also combines the mean accumulators in a meaningful way. The guiding idea is that you can combine histograms from sub-datasets by adding them. sum captures that.

henryiii commented 5 years ago

Closed by #185.

Note that, once we fully implement UHI, sum will work, whether we like it or not. np.sum will, as well - any function that takes an array and returns a scalar is valid. My suggestion was simply to allow the python sum to trigger the built-in machinery, making it fast and allowing it to work on all storages. (Implementing the full UHI requires that arrays of accumulators work, so #186 or similar is needed).

henryiii commented 5 years ago

Also note that calling bh.sum is identical to calling np.sum, so from boost_histogram import * is no worse than from numpy import *. If we did allow the builtin sum to trigger the shortcut, we could then remove "sum" from __all__ (which we don't have yet).

scikit-hep / boost-histogram

bh.project rename #136