Efficiency diagram - Githubissues

weidenka commented 5 years ago

Hi, I would like to produce a Graph/Histogram similar to ROOT's TEfficiency. So I have two data samples var_total and var_pass. I thought one could probably use the cut axis but I didn't succeed. Furthermore, a correct error estimate (e.g. binomial errors) would be necessary. Any idea how to do this using histbook?

weidenka commented 5 years ago

Ok here is my current solution:

pt = 1*np.random.randn(n) + 5
ps = np.random.randint(0,2,n)
e = Hist(bin('pt', 50, 0, 10), profile('eff'))
e.fill(pt=pt, eff=ps)
beside(e.step('pt'), e.marker("pt", "eff")).to(canvas)

Still I have to problem, that the errors are not binomial.

jpivarski commented 5 years ago

If the pass and total are coming from two different sources, then no, you can't use a cut axis. I hadn't realized that this is a use-case because from two sources you can't as easily control for accidentally lost events. The different axes guarantee that numerator and denominator have the same normalization— if they're in two different sources, you have to take care to avoid loses in whatever system is producing the two sources.

But okay, supposing you've done that, now what?

There are two techniques for merging pre-filled histograms from different sources; neither of which will help you in your particular case. One (group) turns the sources into categories in a categorical axis, while the other allows you to mask out whole histograms (which are additive, not axes, which are multiplicative) during filling. Neither of these help.

I am working on another iteration of histogram API right now, so it's good for me to know that there are uses like this that need to be made possible, but that doesn't help you right now.

Do you know about Physt? After writing histbook, I learned about more histogram tools in this ecosystem and have turned toward writing connectors to use them all together. Physt might have something that will help.

At worst, here's a hack: introduce a dummy variable q and define a cut axis that cuts on q (that's the full expression; it will be a boolean variable). You have a numerator source and a denominator source. First fill with the denominator source and q=numpy.zeros(length, bool) and weight=1. Since all of these q are false, they'll fill only the denominator. Now fill with the numerator source and q=numpy.ones(length, bool) and weight=1, which fills both the histogram's numerator and denominator with your numerator source. The denominator is double-filled with your numerator events, so fill with numerator again, but with q=numpy.zeros(length, bool) and weight=-1. This will remove the double-counted numerator sample from the histogram's denominator but not from the histogram's numerator.

Needless to say, it wasn't supposed to be this complicated. Traditionally, we've had all these free-floating histograms (like the two in TEfficiency), filled them all separately, and then managed the bookkeeping of combining them correctly after the fact. The idea of histbook is to have the filling process fill them correctly and manage that bookkeeping for you. However, if you have samples designed with the first model in mind— namely, that you'll be combining them after the fact— having the computer manage the bookkeeping becomes an impediment because now you have to "fool" the computer, as we've done in the above paragraph. I need to design a system that is usable both ways.

weidenka commented 5 years ago

Ok I see the problem. Actually my problem is not as compilcated as I described before. So lets simplify the example from above:

data = pd.DataFrame(columns=['pt','found'])
data['pt']=1*np.random.randn(n) + 5
data['found']=[x>0.3 for x in np.random.rand(n)]
e = Hist(bin('pt', 50, 0, 10), profile('found'))
e.fill(pt=data['pt'], found=data['found'])
e.marker("pt", "found").to(canvas)

visualization 1

Now I still have the problem with the bin-to-bin uncertainties. I tried to export the histogram to a DataFrame and calculate the uncertainties manually:

# Calculating binomial errors sigma = 1/N sqrt( N e (1-e))
d['err(found)'] = 1/d['count()']*np.sqrt(d['count()']*d['found']*(1-d['found']))
print(d.head(20))

How can I get this column back to the histogram in order to draw it?

You suggest to use physt and you mentionen that you are a bit unsure about the future of histbook. My opinion is that currently there is no proper (python) alternative to ROOT TH1, except probably histbook. I miss some features in histbook but in principle I like to approach which makes it is very convenient to plot a large bunch of histograms at once. I had a quick look at physt but so far I see little advantage compared to simply use numpy.histogram + matplotlib.

jpivarski commented 5 years ago

Oh! You're free to construct the data as you'd like! Well, in that case, please do buy into the paradigm of having the axis apply the cut for you (which is not exactly what you did in your example above: a profile is not an efficiency; see below).

I tried to walk through the "intended" method for making this sort of plot. What I found were several indications of incomplete work:

A very simple bug prevented the creation of fraction tables: it tries to remove profiles even when no profiles are present. That has been fixed in histbook 1.2.5 (just released).
There is code for computing fractions/efficiencies, but it hasn't been carried all the way through to the plotting stage. As it is, you can get a DataFrame of fractions with fractional errors, but not a Vega plot.
Only "normal" (i.e. assumes binomial distribution with errors → 0 at 0% and 100%) and "wilson" (an easy-to-compute correction, but not the physics-recommended one) have been implemented. The rest ("clopper-pearson", "agresti-coull", "feldman-cousins", "jeffrey", and "bayesian-uniform") have been stubbed out, waiting for the formula to be added to the code.

Here's how I managed to use it (with the bug-fix):

%matplotlib inline
import pandas
import numpy
from histbook import *

data = pandas.DataFrame(columns=["pt", "found"])
data["pt"] = numpy.random.randn(10000) + 5
data["found"] = (numpy.random.randn(10000) > 0.3)     # do vectorized Numpy operations!

h = Hist(bin("pt", 50, 0, 10), cut("found"))
h.fill(data)

# get a DataFrame with "counts()", "found", and "err(found)" as columns
# this DataFrame is the histogram (indexes are intervals)
df = h.pandas("found", error="normal")

# to plot it, however, we need a column of midpoints of those interval indexes
df["midpoints"] = [x[0].mid if isinstance(x[0], pandas.Interval) else numpy.nan for x in df.index]

# now scatter plot with error bars
df.plot.scatter(x="midpoints", y="found", yerr="err(found)")

output

About the future of this package: it will at least be rewritten. With uproot, it was pretty clear what the package needed to do and how it should be internally organized (though that went through one major revision, from v1 to v2). With histogramming, it's both a field crowded with alternatives (though you and one other user have expressed doubts about Physt, and Boost.Histogram's Python bindings aren't here yet), and it's also less clear what the package ought to do. HBOOK, PAW, and ROOT have a way of doing histograms in languages where it's not a performance bottleneck to fill one entry in a loop. What's the right calling convention when filling whole arrays at once? The histograms need to do some calculations internally since they now get a whole dataset or a chunk of a dataset at a time; they can't just take precalculated data in a fill method.

Moreover, is the HBOOK/PAW/ROOT way the right way anymore? That kind of API encourages the creation of many small objects, each a 1-, 2-, or 3-dimensional binning of the data, and profiles are a completely separate case from visualizing distributions (the legacy of the HPROF package). That was great for analyses when I was a grad student, in which all the data I wanted to fit were in a single histogram, but now physicists are performing massively combined fits using many histograms to specify signal and control regions and systematic variations of the same distribution. Hundreds of little histograms that have to be filled different ways then have to be gathered up and put into the right places in the fit— the CMS Combine tool's configuration is getting very complicated.

histbook attempts to solve some of that, but we're going to need to make big changes to iterate toward the right solution. Instead of building "the right" histogramming library, I think it makes more sense at this point to make the right histogram representation so that users can move between histogramming libraries without manual translations at the border. So I've been focusing on the smaller problem of just representing histograms in as general a way as possible: not filling, adding, or plotting. (See https://github.com/diana-hep/histos, a histogram protocol in development using Google Flatbuffers.) Boost.Histogram will likely be the best histogram-filler, but it remains to be seen what is the most convenient way to manipulate them or plot them.

Anything that goes into histbook will be raw material for that development. If, for instance, you want to add the Clopper-Pearson error calculation to histbook as a pull request, I'll accept it and it will make it that much easier to integrate it into future histogramming libraries. I just don't want to get your hopes up that histbook itself will be a fixed syntax going forward.

weidenka commented 5 years ago

Thanks a lot for your work and your detailed explanations.

scikit-hep / histbook

Efficiency diagram #53