scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

Specialized Books #19

Closed jpivarski closed 6 years ago

jpivarski commented 6 years ago

@lukasheinrich: is the common numerical unit "number of sigmas" or "confidence level"? There's a monotonic function between them (assuming Gaussian). Number of sigmas is easier to work with because it ranges over the whole real line, but confidence level is more semantically appropriate for probability distributions that aren't Gaussian. Like, if the probability distribution has polynomial tails with a certain degree, it doesn't have a standard deviation to make the translation.

lukasheinrich commented 6 years ago

Right now the space of fit parameters is R^N without any bounds (bounds are applied in the fit, but these are mostly arbitrary). Right now we do indeed scale that space via number of sigmas (e.g. up/nominal/down are at 1,0,-1 respectively, but I think this is mostly an implementation detail. It is the job of the analyzer to specify at which floating point value (using units s/he determines are best) the histograms live. I think for histbook it's enough to use R^n or string^n and leave the units to the user (or alternatively fix a unit system and then it's the job of the user to provide the right floats)

There might also be a non-trivial condition on the validity of the interpolation/extrapolation.. but that is probably the job of the interpolation to handle (eg. only some subspace of R^n might be valid) trying to request a histogram outside of that subvolume should raise an error/exception.

jpivarski commented 6 years ago

The reason I ask is because one of histbook's methods, Hist.fraction, makes an efficiency plot with a choice of binomial statistics (e.g. Wilson, Clopper-Pearson, Feldman-Cousins, etc.). One of the inputs to this calculation is a user-specified "number of sigmas" or "confidence level" where the user would like to have the asymmetric error bounds quoted. So far, I've been using confidence level because it's conceptually more relevant for arbitrary probability distributions, but it means that the default value is erf(sqrt(0.5)) (one sigma).

I'm trying to decide which are the right "units" for the statistical error space. If you're using number of sigmas n in R^N and nobody's complained that computing confidence levels with CL = erf(sqrt(0.5*n)) goes through a Gaussian assumption that might not be valid, then I'll do it, too. R^N is much easier to work with than (0, 1)^N and if a Gaussian assumption is not valid, then we just say that erf(sqrt(0.5*n)) is a "convenient transformation function" that we use to scale confidence levels.

jpivarski commented 6 years ago

Apart from the discussion about sigmas versus confidence levels, this is actually #20, which is done.