Open bplimley opened 7 years ago
Good question, and I don't know what we should do here. As much of a fan of the uncertainties
package I am, in recent years I have become very aware of the potential danger behind implicit Gaussian assumptions, and assigning uncertainties with this package and using it for (Gaussian) error propagation is exactly that. I would vote for some convention that would cause the least confusion. I know that uncertainties
cannot handle asymmetric error bars, so if we go this route we would have to write it ourselves.
I like the discussion in [1] and think it would be neat if we could support plots like the "rainbow" plots they show in Figure 2.
Whatever we decide here, one of the things I would like us to support is using a Poisson maximum likelihood estimator for our fits. This would be appropriate for all N including 0, not just N >~ 20, which the Gaussian approximation does well for. If n_i
are the data and y_i
the model prediction, for a Gaussian, we would minimize
\sum_i \frac{(n_i - y_i)^2}{n_i}
(i.e., the chi-squared) but for a Poisson distribution we should minimize
\sum_i (y_i - n_i \log y_i)
All that sounds good.
One approach would be to have a module-level option for the type of uncertainties used, Gaussian or at least one Poisson-friendly option. I'm not clear on how error propagation works for non-Gaussian intervals; does that get really ugly for some of these Poisson-friendly uncertainties?
For now, how about we use the uncertainties
package for Gaussian uncertainties so we can have something simple to work with?
Yes, error propagation is ugly for non-Gaussian uncertainties. At worst it can involve Monte-Carlo sampling the random variables involved to determine the PDF of the quantity you are trying to add uncertainties to.
Using uncertainties
shouldn't be a substitute for using the correct statistics, but it will be essentially correct for (hopefully) most cases and will at the very least remind the user that the quantity has an implicit statistical interpretation.
I vote for using sqrt(N) and 1 for N=0 because
uncertainties
I thought about this some more and realized something interesting about the Poisson log likelihood:
L(n, y) = \sum_i (y_i - n_i \log y_i)
If the measured spectrum (n_i
) and the model (y_i
) are both multiplied by a constant a
(e.g., dividing by the livetime and/or the bin width), then we can equivalently minimize the same log likelihood function, but with the scaled variables:
L(an, ay) = \sum_i (a y_i - a n_i \log (a y_i)) \\
= \sum_i (a y_i - a n_i \log (a y_i)) \\
= \sum_i (a y_i - a n_i \log y_i - a n_i \log a) \\
= a \sum_i (y_i - n_i \log y_i) - constant \\
= a L(n, y) - constant \\
So we could use this to our advantage. Surprisingly this method doesn't even require that we keep the original n_i
or estimate their uncertainties; it is merely enough to know that the n_i
are Poisson-distributed, up to a scaling factor.
Note that this does not seem to apply if different scaling factors are applied to each measurement, such as dividing by varying bin widths. It also doesn't apply if another measurement has been subtracted from each n_i
(e.g., background subtraction).
For counts
N
, we commonly usesqrt(N)
as the 1-sigma uncertainty, which is valid for Gaussian statistics (N >> 0). However, in spectra we often deal with counts that are low or zero. Specifically,sqrt(N)
gives an uncertainty of 0 in the case of 0 counts.Personally in these situations I have used a suggestion I found in this Fermilab text article, and adopted
+-0.5 + sqrt(n+0.25)
. Another solution could be to usesqrt(N)
for N > 0, but specify an uncertainty of 1 in the case of N = 0.How do others handle this?