stcorp / harp

Data harmonization toolset for scientific earth observation data
http://stcorp.github.io/harp/doc/html/index.html
BSD 3-Clause "New" or "Revised" License
55 stars 19 forks source link

Add uncertainty propagation #13

Open svniemeijer opened 8 years ago

svniemeijer commented 8 years ago

All algorithms in HARP should be extended such that they can propagate the associated uncertainties of the inputs to an uncertainty on the output.

This should be performed automatically and as part of the algorithms themself. It should not be needed to separately call functions for both <quantity> and <quantity>_uncertainty.

One of the main references for the definition of uncertainty is the Guide to the expression of uncertainty in measurement (GUM)

When propagating uncertainties one should take into account correlations. Correlations between uncertainties of different quantities are almost never represented in data products; the only correlations captured are auto-correlations (linking the same measured quantity at different positions accross a certain dimension (time, horizontal, vertical)). In practice there are only two ways that auto-correlations get described. There is the auto-correlation along the vertical axis for atmospheric retrievals (which correlates uncertainties between different height levels) which is covered by a covariance matrix. And there is the distinction between fully uncorrelated and fully correlated when considering the time or horizontal spatial domain (used when averaging measurements over time or accross an area). This distinction between fully uncorrelated (auto-correlation factor of 0) and fully correlated (auto-correlation factor of 1) is in practice often described with the terms 'uncertainty due to random effect' and 'uncertainty due to systematic effect'. However, an auto-correlation factor 1 is just one of the systematic effects (i.e. a horse is an animal, but not all animals are horses) that could be present. The usage of systematic effect is therefore not ideal if later on also other systematic effects need to be captured.

Since the auto-correlation factor impacts the uncertainty propagation in 'averaging' routines, any calculated uncertainty should thus maintain information on which part is auto-correlated and also in which way.

It is to be determined whether this tracing of the different auto-correlated components is to be maintained using the current convention of having separate uncertainty_random and uncertainty_systematic variables or whether another approach is to be used. The downside of keeping separate systematic and random components is that it requires the need to also support a third variable which is the total uncertainty. And the total uncertainty is actually the quantity that users will want to use in the end. For this reason it may be more useful to go for a solution that is based on having the total uncertainty as one variable and having a second variable to indicate the systematic part of this uncertainty (either by means of an uncertainty_systematic or by means of an auto-correlation indicator). The random component of the uncertainty is then provided implictly and will have to be calculated by propagation routines if it is needed explicitly. For instance, we could have a correlation factor c and total uncertainty uncertainty such that uncertainty_systematic = sqrt(c)*uncertainty and uncertainty_random = sqrt(1-c)*uncertainty.

An implementation solution to consider is to extend the harp_variable in the C library interface to also contain this uncertainty information (e.g. have a harp_array uncertainty and some other component(s) that store the correlation information). Having uncertainty together with the data will greatly ease the propagation of uncertainties in HARP. Storage of such information into netcdf/hdf4/hdf5 will however still require that uncertainty information gets stored into separate variables. But this could be solved by a slightly more complex mapping between in-memory harp products and harp product content stored on disk (the python interface could still map more closely to the C representation though). This seems to have some links to the concept of uncertain numbers. The terminology used in that article may be useful to better define uncertainty propagation.

The exact formula to use for the uncertainty propagation is still to be determined. The most common approach is to use the linearized form. However, this is not the same thing as requiring a linearized form of the function for the quantity itself.

What is needed is that the derived variable functions will need to calculate two things at once, the quantity and its associate uncertainty. The formula for the quantity and uncertainty can be different in terms of assumptions regarding approximations (i.e. non-linear for the quantity and linear for the uncertainty).

Another choice that will need to be made for the combined algorithm is whether an inherent bias that gets introduced by a non-linear formula for the quantity (due to its uncertainty) is to be corrected for in the quantity. For instance, the operation x^2 introduces a bias of sigma^2 (and there are similar biases when using 1/x). Performing such bias correction is controversial, but just ignoring the bias is not correct either (especially if the uncertainty is large, which is often the case for atmospheric remote sensing L2 data).

blangerock commented 8 years ago

It may be impossible to calculate the random uncertainty from the total and systematic uncertainty. Eg if the random component is very small compared to the systematic. Depending on the precision (single or double float), the systematic and total can be equal and the random is not seen....

The derived variable function, if non-linear, should also contain its linearized form to propagate the uncertainties.

svniemeijer commented 8 years ago

If the random uncertainty is that much smaller than the systematic uncertainty then it does not matter that it is 'invisible', since there is no operation that will reduce the systematic uncertainty such that the random uncertainty will start to matter. Performing propagation operations (when kept linear) will also not cause problems if the random component is treated as near-zero.

The opposite is problematic though, since averaging will reduce the random uncertainty, but retain the systematic uncertainty. This means that a small systematic uncertainty can still become dominant when averaging enough sample. For this reason it is best to keep the systematic part explicit.

blangerock commented 8 years ago

I wrote this comment with the recent LIDAR template in mind: their dominant term is the random component. When reporting the random and total uncertainty (as was proposed in their template) in single precision, one could not determine the systematic uncertainty, and that was important when averaging measurements.

I would propose to work with the random/systematic split for another reason: to determine a confidence interval, one needs the mean error (systematic) and std of the error (random).

What is actually needed right now is a good mathematical definition of systematic and random. I really push the above mathematical definition. From the other post, I saw that you see any correlation as a systematic uncertainty, which is different from my interpretation.

StevenCompernolle commented 8 years ago

A small terminology remark. The article of Hall is about the concept of "uncertain numbers", not "uncertainty numbers" as we are talking about 'quantities with an associated uncertainty', not only about the uncertainty itself

svniemeijer commented 8 years ago

Description updated to say 'uncertain numbers'