Support for multiple weight variations

scikit-hep / boost-histogram

Python bindings for the C++14 Boost::Histogram library

https://boost-histogram.readthedocs.io

BSD 3-Clause "New" or "Revised" License

143 stars 22 forks source link

Support for multiple weight variations #83

Open douglasdavis opened 5 years ago

douglasdavis commented 5 years ago

A feature I recently implemented in pygram11 allows an array of input data to be histogrammed with multiple weight variations in a single function call. In ATLAS (and I'm sure most HEP experiments) we carry around a lot of MC generator weights and derive uncertainties from many scale factor variations-- in the downstream parts of an analysis we end up comparing a histogrammed variable using one set of weights to the same variable histogrammed with a different weight variation. For a histogram with non-fixed width binning there's actually a nice speedup from avoiding repeating the binary search to grab the necessary bin index.

I'm not sure how this might fit into the object-focused design of boost-histogram, but I just wanted to mention it as an idea for a feature-- perhaps if there ends up being some kind of factory to create histograms from a function call which has the data to be histogrammed as an argument.

HDembinski commented 5 years ago

If I understand you correctly, you want to store not only one sum of weights for each histogram cell, but several sums of weights. It is possible to achieve this with the current machinery, by writing a special cell accumulator and using the sample argument for the fill method.

perhaps if there ends up being some kind of factory to create histograms from a function call which has the data to be histogrammed as an argument.

The new fill method allows you to pass arrays with the input data. It is not a factory, so you first generate the histogram and then pass the data, but we could easily add a function which does these two steps with a single call.

HDembinski commented 5 years ago

I will think about adding an example to the user guide about this. I don't have many examples yet for custom accumulators that use samples.

HDembinski commented 5 years ago

Thinking more about it, it is probably even more straight-forward to support this. The default weight type in boost::histogram is double, but this can be replaced by another multi-weight type. The type has to be additive, multiplicative, and broadcastable.

HDembinski commented 5 years ago

@douglasdavis How many weights do you need to keep track of simultaneously? 2, 4, 16, ...?

henryiii commented 5 years ago

I think @douglasdavis simply wants this (could be wrong):

h = bh.histogram(bh.axis.regular(10,0,1), storage=bh.storage.weight)
h.fill([1,2,3], weight=[1,1,1])
h.fill([1,2,3], weight=[2,2,2])
h.fill([1,2,3], weight=[3,3,3])

To be expressible as:

h = bh.histogram(bh.axis.regular(10,0,1), storage=bh.storage.weight)
h.fill([1,2,3], weights=[[1,1,1], [2,2,2], [3,3,3]])

(And, if the axis type(s) are variable or such, the idea is that these would save costly lookups per bin). I think the storage is still just a standard double or weight storage.

HDembinski commented 5 years ago

That is also what I understood, but with the machinery we have right now, you can only have one weight.

HDembinski commented 5 years ago

You need a special storage for that case, bh.storage.weight doesn't work. In each histogram cell, you need to keep track of three weight sums instead of one. The cell size is larger.

HDembinski commented 5 years ago

It is a good idea to support this, but this is something for later, not for the first release.

HDembinski commented 5 years ago

It also needs some changes to boost::histogram to work in boost-histogram.

douglasdavis commented 5 years ago

Hi Hans & Henry, thanks for taking a look at this!

It looks like we're on same page about the idea I was proposing. I think an additional storage type (as mentioned by Hans), such that a single histogram can have multiple different counts (one counts array for each weight variation), makes sense as well.

HDembinski commented 5 years ago

@douglasdavis Could you please say how many different weights you want to track simultaneously? (see my previous question regarding that)

douglasdavis commented 5 years ago

Hi @HDembinski -- for my specific use case it has varied, I've had up to N=200 different sets of weight variations used to calculate N histograms of the same raw distribution. I envisioned the feature as taking any number of weight variations.

HDembinski commented 5 years ago

Wow, thanks that was an important design input for the implementation.

nsmith- commented 3 years ago

I want to second this feature request. In fact, some use cases in EFT re-interpretation require several thousand weight variations per bin.

klannon commented 2 years ago

I just wanted to ask about the prospects for this issue. This is something my analysis team is waiting for.

NJManganelli commented 2 years ago

Josh Bendavid had an interesting talk on integrating boost::histogram with RDF, and comparing different implementations for dealing with the pdf (weight-based) variations (hopefully he doesn't mind me linking this here). There was an addendum on where to place that axis, too (first vs last): https://indico.cern.ch/event/1127096/