scikit-hep / aghast

Aghast: aggregated, histogram-like statistics, sharable as Flatbuffers.
BSD 3-Clause "New" or "Revised" License
17 stars 8 forks source link

Binomial error or Poisson error #13

Closed HDembinski closed 5 years ago

HDembinski commented 5 years ago

It says in the docs:

"The error_method does not specify how the histograms or functions were filled, but how the fraction should be interpreted statistically. It may be unspecified, leaving that interpretation unspecified. The normal method (sometimes called “Wald”) is a naive binomial interpretation, in which zero passing or zero failing values are taken to have zero uncertainty. The clopper_pearson method (sometimes called “exact”) is a common choice, though it fails in some statistical criteria. The computation and meaning of the methods are described in the references below."

If Aghast is only about exchanging data, is it necessary to specify the error_method? This is not part of the data, but part of the interpretation. You follow here the assumption that histogram counts should be modeled as binomial proportions, but in HEP that usually does not make sense. We use the Poisson model not only because it is easier, but because it is the better approximation of reality.

Example. You want to measure the differential cross-section of some process p+p -> X as a function of a kinematic variable, let's say eta. You count how often a particle X falls into an eta bin, call that \Delta N. The differential cross-section then is:

d\sigma / d\eta = 1/L \Delta N / \Delta \eta

where L is the luminosity. I use "=" here, although strictly speaking the equality requires the bin width to go to zero.

You care about the uncertainty of d\sigma / d\eta. You don't know the total sum of all \Delta N here, because you have not observed all X, only those in the acceptance of your detector, so you cannot compute the binomial proportional error. The measurement also doesn't depend on the fraction of events in each bin, but really on the number, which you divide by L and not sum over \Delta N. The Poisson is more correct here.

It is not exactly correct either, because the bins are not really independent for a physics reason: you observe K particles per event, and these K particles are not generated independently, they form jets and have some correlations. We usually neglect these correlations. They are not correctly represented by either Poisson or binomial proportion errors.

jpivarski commented 5 years ago

I think one of the options for this field is "no interpretation." It doesn't need to be filled.

Some of the data in these objects are necessary for the various histogram → histogram conversions it needs to perform. Some only set the context, like this one. A date analyst would strongly desire to attach such information, as it would simplify bookkeeping, and much of it may be information we haven't thought of: there are JSON-valued metadata fields for that. But some metadata are significant enough to have a special form, so that each user doesn't invent their own format.

Also, I'm not sure you're interpreting FractionBinning right: it's for "efficiency plots." The counts in each corresponding bin of the numerator and denominator are highly dependent, but neighboring bins are almost always independent. If neighboring bins are not independent, then all of the error methods listed here would be inapplicable.