Open douglasdavis opened 5 years ago
If I understand you correctly, you want to store not only one sum of weights for each histogram cell, but several sums of weights. It is possible to achieve this with the current machinery, by writing a special cell accumulator and using the sample
argument for the fill method.
perhaps if there ends up being some kind of factory to create histograms from a function call which has the data to be histogrammed as an argument.
The new fill method allows you to pass arrays with the input data. It is not a factory, so you first generate the histogram and then pass the data, but we could easily add a function which does these two steps with a single call.
I will think about adding an example to the user guide about this. I don't have many examples yet for custom accumulators that use samples.
Thinking more about it, it is probably even more straight-forward to support this. The default weight type in boost::histogram is double, but this can be replaced by another multi-weight type. The type has to be additive, multiplicative, and broadcastable.
@douglasdavis How many weights do you need to keep track of simultaneously? 2, 4, 16, ...?
I think @douglasdavis simply wants this (could be wrong):
h = bh.histogram(bh.axis.regular(10,0,1), storage=bh.storage.weight)
h.fill([1,2,3], weight=[1,1,1])
h.fill([1,2,3], weight=[2,2,2])
h.fill([1,2,3], weight=[3,3,3])
To be expressible as:
h = bh.histogram(bh.axis.regular(10,0,1), storage=bh.storage.weight)
h.fill([1,2,3], weights=[[1,1,1], [2,2,2], [3,3,3]])
(And, if the axis type(s) are variable or such, the idea is that these would save costly lookups per bin). I think the storage is still just a standard double or weight storage.
That is also what I understood, but with the machinery we have right now, you can only have one weight.
You need a special storage for that case, bh.storage.weight
doesn't work. In each histogram cell, you need to keep track of three weight sums instead of one. The cell size is larger.
It is a good idea to support this, but this is something for later, not for the first release.
It also needs some changes to boost::histogram to work in boost-histogram.
Hi Hans & Henry, thanks for taking a look at this!
It looks like we're on same page about the idea I was proposing. I think an additional storage type (as mentioned by Hans), such that a single histogram can have multiple different counts (one counts array for each weight variation), makes sense as well.
@douglasdavis Could you please say how many different weights you want to track simultaneously? (see my previous question regarding that)
Hi @HDembinski -- for my specific use case it has varied, I've had up to N=200 different sets of weight variations used to calculate N histograms of the same raw distribution. I envisioned the feature as taking any number of weight variations.
Wow, thanks that was an important design input for the implementation.
I want to second this feature request. In fact, some use cases in EFT re-interpretation require several thousand weight variations per bin.
I just wanted to ask about the prospects for this issue. This is something my analysis team is waiting for.
Josh Bendavid had an interesting talk on integrating boost::histogram with RDF, and comparing different implementations for dealing with the pdf (weight-based) variations (hopefully he doesn't mind me linking this here). There was an addendum on where to place that axis, too (first vs last): https://indico.cern.ch/event/1127096/
A feature I recently implemented in pygram11 allows an array of input data to be histogrammed with multiple weight variations in a single function call. In ATLAS (and I'm sure most HEP experiments) we carry around a lot of MC generator weights and derive uncertainties from many scale factor variations-- in the downstream parts of an analysis we end up comparing a histogrammed variable using one set of weights to the same variable histogrammed with a different weight variation. For a histogram with non-fixed width binning there's actually a nice speedup from avoiding repeating the binary search to grab the necessary bin index.
I'm not sure how this might fit into the object-focused design of boost-histogram, but I just wanted to mention it as an idea for a feature-- perhaps if there ends up being some kind of factory to create histograms from a function call which has the data to be histogrammed as an argument.