kahaaga / UncertainData.jl

Working efficiently with datasets where observations have associated uncertainties.
https://kahaaga.github.io/UncertainData.jl/dev
MIT License
17 stars 5 forks source link

relationship to Measurements.jl? #24

Open mkborregaard opened 5 years ago

mkborregaard commented 5 years ago

Just curious - what's the relationship/overlap/complementarity of this package and the uncertainty machinery in https://github.com/JuliaPhysics/Measurements.jl?

kahaaga commented 5 years ago

Thanks for bringing Measurements.jl to my attention; it had completely slipped my radar. I’ll make sure to include a link to it in the documentation in the next release!

From what I can see, the main difference between the packages is that Measurements.jl always assumes that the uncertainty is a standard deviation (so working with normal distributions), and that it focuses on the propagation of errors during computations using linear error propagation theory.

In practice, however, values are often not normally distributed. The purpose of UncertainData.jl is mainly to have a consistent way of treating uncertain values of all types, be it theoretical distributions with known parameters (e.g. a normal distribution), populations with weighted probabilities, or more complex empirical distributions that have to be approximated by a kernel density estimate. With UncertainData.jl, the user should be able to mix all types of uncertain values and resample the data (possibly subject to some sampling constraints) consistently, regardless of uncertainty type.

A user can use this to, for example, visualize uncertainties of different types in a consistent manner by applying the same constraint across observations (see the example below). Other use cases might be to compute some ensemble statistic of your uncertain dataset over a large number of independent realizations (which you get by resampling the dataset).

Thus, the main difference between this package and Measurements.jl is that that we focus on flexibility in data representation and easy resampling of different types of uncertain data. I'd use Measurements.jl if my data were normally distributed and my goal was to perform some computation and get exact uncertainties using linear error propagation theory. Measurements.jl also handles functional correlation and some linear algebra operations, which is very neat. If I'm interested in ensemble statistics over more complex uncertain datasets, I'd use the approach in this package. For statistical computations, the approach here is to just resample your data a bunch of times, then obtain statistics which themselves have some uncertainty. To compute some statistic over an uncertain value or dataset, the approach is basically this - which might or might not be sufficient for your needs, depending on how rigorous you want to be.

For example, I'm currently integrating this package with CausalityTools.jl to, in a consistent manner, measure information flow between time series where both time indices and data values are uncertain and follow different distributions (i.e. asking, for example, there asymmetric information flow between two time series even when I take into consideration dating errors and experimental error?).

I was motivated to create this package because in my field, working with paleoclimate and biological proxy records, data values often come from different sources and have uncertainties that follow all kinds of distributions (often multimodal). Also, measurements often have uncertainties both in the data index (time, for example) and in the data value, which have to be treated separately (functionality for handling this is coming in a future release). To avoid re-inventing the wheel for every new dataset I work with, I decided to make this package instead, so I can do data analysis with uncertain data in a consistent manner.

In the future, the aim is to improve the resampling part by integrating with Interpolations.jl, provide options for data imputation, support uncertain vectors (i.e. represented by multivariate distributions), make some plot recipes where you can directly use sampling constraints and make something that would allow conversion of data from spreadsheets.

In summary: with this package, you may save time when working with values having many different types of uncertainties, and uncertainty estimates on computations are obtained by resampling. Measurements.jl, on the other hand, propagates errors exactly and supports (I think) most or all base Julia math, and might thus be a better choice if your application calls for it, if your data justifies it (are normally distributed) and/or if uncertainties are correlated.

I hope this clarifies things. Comments and suggestions are most welcome!

Example 1

CIn this contrived example, I have a bunch of different observations from different distributions, some with known parameters, some without, and I want to plot realizations falling within the 30-th and 70th percentile ranges of the distributions.

using UncertainData
using Distributions, KernelDensity
using Plots

# Define some theoretical distributions
dμ, dσ = Uniform(-5, 5), Uniform(0.1, 3)
dα, dβ = Uniform(2, 3), Uniform(4, 9)
normally_distributed_vals = [UncertainValue(Normal, rand(dμ), rand(dσ)) for i = 1:10]
beta_distributed_vals = [UncertainValue(Beta, rand(dα), rand(dβ)) for i = 1:10]

# Create some samples obeying some complicated distribution and represent 
# the samples by kernel density estimates.
M1 = MixtureModel([Normal(1, 0.5), Gamma(0.3, 5), Normal(2, 0.3)])
M2 = MixtureModel([Gamma(3, 4), Beta(3, 4), Frechet(4, 0.3)])
n = 1000
empirical_samples = [rand(M1, n), rand(M2, n)]
kde_dists = UncertainValue.(empirical_samples)

# Gather everything in an uncertain dataset
uvals = [normally_distributed_vals; beta_distributed_vals; kde_dists]
udata = UncertainDataset(uvals)

# Plot 1000 realizations of the uncertain dataset
plot(resample(udata, 1000), lc = :black, lα = 0.01, legend = false)

# Resample 1000 new realization, but restricting each observation to the 30th-to-70th 
# percentile range of its support
plot!(resample(udata, TruncateQuantiles(0.3, 0.7), 1000), 
    lc = :red, lα = 0.01, legend = false)

image

Example 2

Another contrived example: Following the example above, let's say I had two uncertain datasets.

udata1 = UncertainDataset([beta_distributed_vals; normally_distributed_vals; kde_dists])
udata2 = UncertainDataset([normally_distributed_vals; beta_distributed_vals; kde_dists])

I now want to compute the Pearson correlation between the datasets. To do this I can generate, say, 10000 realizations of each dataset and compute the correlation between each pair of realizations.

corrs = cor(udata1, udata2, 10000)

We can take this further. Maybe I had good reason to not trust outliers in my dataset. In this case, I could for example restrict the values each observation cold take to the 25th-to-75th percentile range.

corrs = cor(udata1, udata2, 10000, TruncateQuantiles(0.25, 0.75))

Maybe I want to disregard outliers for most observations, but not the five first:

constraints = [[NoConstraint() for i = 1:5]; [TruncateQuantiles(0.2, 0.8) for i = 1:length(udata)-5]]
constrained1 = constrain(udata1, constraints)
constrained2 = constrain(udata1, constraints)

corrs = cor(constrained1, constrained2, 10000)

From this, I can estimate the correlation by the mean of the pairwise correlations.

μ, σ = mean(corrs), std(corrs)

There are probably better ways of doing this, but the examples at least illustrate how the package works.

mkborregaard commented 5 years ago

Awesome, thanks for the response. This looks really nice, certainly very useful. I'm a biologist myself and agree that uncertainty is too often ignored. It almost sounds like Measurements.jl could improve by incorporating/depending on this (I have no relationship to that library).

rofinn commented 5 years ago

It'd be nice if we could define a common interface for resampling methods across the stats ecosystem, so that we could plugin various other approaches (e.g., Jackknife.jl, Bootstrap.jl). Maybe StatsBase.jl should define resample([f::Function, method::AbstractResampling], data) -> AbstractSampling that other packages could extend?

kahaaga commented 5 years ago

A common resampling interface is an excellent idea! I've sketched out some ideas below - is this roughly what you had in mind, @rofinn?

Common interface (in StatsBase.jl)

""" 
    AbstractResampling

Abstract type for resampling methods.
""" 
abstract type AbstractResampling end

"""
    AbstractSampling

Abstract type for the result of a resampling.
"""
abstract type AbstractSampling end

""" 
    resample

Resampling interface.
""" 
function resample end 

"""
    resample(f::Function, data, method::AbstractResampling) -> AbstractSampling

Draw subsets/realizations of `data` using the provided resampling `method` and 
apply the point estimator `f` to each of the realizations. 
""" 
resample(f::Function, data, method::AbstractResampling)

""" 
    resample(method::AbstractResampling, data) -> AbstractSampling

Draw subsets/realizations of `data` using the provided resampling `method`.
""" 
resample(data, method::AbstractResampling)

The first method would be for for computing statistics using the point estimator f directly on the draws of data, while the second would give you the subsamples of data directly (without performing any computation on them).

Note that below, I've changed the order of the arguments you proposed, so that the resampling scheme appears last. This would allow having a default resampling method that might be specific to a certain type of data, for example for different types of uncertain values in this package.

In Bootstrap.jl

import StatsBase: resample

""" 
    resample(f::Function, data, method::BootstrapSampling) -> BootstrapSample

Resample `data` using the provided bootstrap `method`, applying the point estimator 
`f` to each sample. 
""" 
function resample(f::Function, data, method::BootstrapSampling)
    bootstrap(f, data, method)
end

In Jackknife.jl

import StatsBase: resample, AbstractResampling, AbstractSampling

""" 
Abstract type for jackknife resampling methods
"""
abstract type JackknifeResampling <: AbstractResampling

"""
A type indicating that leave-one-out resampling should be performed.
"""
struct LeaveOneOutResampling <: JackknifeResampling end

""" 
    JackknifeSampling

Composite type that holds the result of a jackknife resampling. 

## Fields 
- **`estimates`**: The `n-1` jackknifed estimates of `statistic` computed from `data`. 
- **`statistic`**: A point estimator of some statistic, for example `Statistics.mean`. 
- **`method`**: The jackknife resampling method, for example `LeaveOneOutResampling`.
- **`data`**: The data from which we compute estimate the statistic. 
""" 
struct JackknifeSampling{T, DT} <: AbstractSampling
    estimates::Vector{T}
    statistic::Function
    method::JackknifeResampling
    data::DT
end

""" 
    resample(f::Function, data, method::LeaveOneOutResampling)

Compute the jackknife estimate to a statistic by resampling `data` using 
leave-one-out sampling and applying the point estimator `f` to each 
of the samples.
""" 
function resample(f::Function, data, method::LeaveOneOutResampling)
    estims = Jackknife.leaveoneout(f, data)
    JackknifeSampling(estims, f, method, data)
end

Resampling from the user's perspective

The user would then just load StatsKit, and be able to do the following:

estimator = mean
some_data = rand(100)

resample(estimator, some_data, LeaveOneOutResampling())
resample(estimator, some_data, BasicSampling(500))

which dispatches to the correct methods in the respective packages.

In UncertainData.jl

When you say plugin other approaches, do you mean that the user should be able to choose arbitrary resampling schemes from Bootstrap.jl and Jackknife.jl when resampling uncertain values / datasets in this package?

The resampling scheme for the currently defined uncertain values in the package is to using basic resampling with replacement, which is basically equivalent to Bootstrap.BasicSampling, so I could change the syntax according to the above proposal. This could for example be done as follows:

import StatsBase: resample, AbstractResampling, AbstractSampling

"""
    AbstractUncertainValueSampling

Abstract type for samples drawn from uncertain values whose ultimate supertype is 
`AbstractUncertainValue`.
"""
abstract type AbstractUncertainValueSampling <: AbstractSampling end
abstract type AbstractUncertainScalarSampling <: AbstractUncertainValueSampling end

# Assume any subtypes of AbstractUncertainScalarSampling is meant to be indexed as an 
# array. This ensures that the already implemented mathematical and statistical operations 
# that relies on resampling remain supported without changing any code.
Base.length(s::AbstractUncertainScalarSampling) = length(s.draws)
Base.size(s::AbstractUncertainScalarSampling) = size(s.draws) 
Base.firstindex(s::AbstractUncertainScalarSampling) = 1
Base.lastindex(s::AbstractUncertainScalarSampling) = length(s)
Base.getindex(s::AbstractUncertainScalarSampling, i) = s.draws[i] 
Base.iterate(s::AbstractUncertainScalarSampling, state = 1) = iterate(s.draws, state)

""" 
    UncertainScalarSampling

A simple vector-like type that holds the draws of a resampling of an uncertain scalar value.

## Fields 
- **`draws`**: A vector containing the resampled draws.
- **`method`**: The resampling method, for example `Bootstrap.BasicSampling(n_draws)`.

""" 
struct UncertainScalarSampling{T, AR <: AbstractResampling} <: AbstractUncertainScalarSampling
    draws::Vector{T}
    method::AR
end

""" 
    ResampledStatistic

A composite type that holds the `estimates` of the `statistic` computed elementwise on 
draws of an uncertain value using the provided resampling `method`. 

## Fields 
- **`estimates`**: A vector containing the statistic computed element-wise on the resampled draws.  
- **`statistic`**: A point estimator of some statistic, for example `Statistics.mean`. 
- **`method`**: The resampling method, for example `Bootstrap.BasicSampling(n_draws)`.
""" 
struct ResampledStatistic{T, AR <: AbstractResampling} <: AbstractSampling
    estimates::Vector{T}
    statistic::Function
    method::AR
end

With that in place, we can dispatch on uncertain value type and resampling scheme:

""" 
    resample(f::Function, uval::AbstractUncertainValue, 
        method::BasicSampling) -> ResampledStatistic 

Draw subsets/realizations of the uncertain value `uval` using the provided resampling `method` and 
apply the point estimator `f` to each realization.
""" 
function resample(f::Function, uval::AbstractUncertainValue, method::BasicSampling)
    estim = f(rand(uval, method.nrun))
    ResampledStatistic([estim], f, method)
end

""" 
    resample(f::Function, uval::AbstractUncertainValue, 
        method::BasicSampling) -> UncertainScalarSampling

Draw subsets/realizations of the uncertain value `uval` using the provided resampling `method`
"""
function resample(uval::AbstractUncertainValue, method::BasicSampling)
    UncertainScalarSampling(rand(uval, method.nrun), method)
end

This should also be extended in some way to uncertain datasets (which there now is a separate issue for: #42).

Resampling uncertain values from the user's perspective

If working with UncertainData.jl, then the user could compute statistics from uncertain values as follows (with similar syntax for uncertain datasets; again, see #42):

# Theoretical values
resample(mean, UncertainValue(Normal, 2, 0.1), BasicSampling(1000))
resample(mean, UncertainValue(Beta, 3.1, 5.2), BasicSampling(1000))
resample(mean, UncertainValue(Uniform, -2, -1), BasicSampling(1000))

# Populations 
resample(mean, UncertainValue([1, 1.5, 3], [0.2, 0.6, 0.2]), BasicSampling(1000))

Omitting the first argument (the point-estimator) simply returns the subsets/realizations.

resample(UncertainValue(Normal, 2, 0.1), BasicSampling(1000))
resample(UncertainValue([1, 1.5, 3], [0.2, 0.6, 0.2]), BasicSampling(1000))
kahaaga commented 5 years ago

Sorry for the late reply, @mkborregaard. I've been caught up in writing my thesis, so I haven't had a chance to delve deeper into the working of Measurements.jl.

Do you have any specific ideas/suggestions on how Measurements.jl might benefit from this package? The only immediate, concrete thought I have is that maybe that when considering normal distributions in this package,it would be cool to be able to use the machinery from Measurements.jl(i.e. the default would be: to use exact error propagation when possible, resample if not). Other than that, I haven't been able to come up with any good ideas for how such a dependence/incorporating process would look like. Although it would be cool, it's a bit difficult for me to clearly see an immediate merging of the packages, considering that the packages aim to do slightly different things. So I'm very happy to hear your ideas if you have any!

To make the gap between the packages smaller, however, I have implemented some of the functionality that Measurements.jl provides for UncertainData. With the latest release, you can now use mathematical operations (addition, subtraction, trig functions, etc) on uncertain values, as well as on uncertain datasets (currently undocumented). Among other things, I'm also planning to add support for units using Unitful.jl, so that unit conversion is handled automatically when performing mathematical operations on uncertain values/datasets.

mkborregaard commented 5 years ago

That sounds amazing. No, it makes sense given the implementation here that this is as far as it can go. Thanks for the response!