mitchelloharawild / distributional

Vectorised distributions for R
https://pkg.mitchelloharawild.com/distributional
GNU General Public License v3.0
97 stars 15 forks source link

More reliable mean() / variance() for dist_default #71

Closed mjskay closed 2 years ago

mjskay commented 2 years ago

Currently for distributions without an implementation of mean(), mean.dist_default() takes the mean of 1000 draws from the distribution:

https://github.com/mitchelloharawild/distributional/blob/40830edf206ea4e9038bc164ad6738981b4dc9af/R/default.R#L78-L80

Naturally this causes some sampling error:

> mean(dist_wrap("lnorm", 0, 1))
[1] 1.76567
> mean(dist_wrap("lnorm", 0, 1))
[1] 1.587978
> mean(dist_wrap("lnorm", 0, 1))
[1] 1.646674
> mean(dist_wrap("lnorm", 0, 1))
[1] 1.706244

I feel like there should be a less noisy alternative. One such might be to take the mean of a large number of quantiles, a sort of quasi-monte carlo approach:

mean.dist_default <- function(x, ...){
  mean(quantile(x, stats::ppoints(1000)), na.rm = TRUE)
}

This yields decent results on a sanity check:

> mean(dist_wrap("lnorm", 0, 1))
[1] 1.645156     # should be exp(0.5)
> exp(0.5)
[1] 1.648721

The approximation might also be improved by using integrate() on continuous distributions, though I'm not sure how reliable this would be across arbitrary distributions:

mean.dist_default <- function(x, ...){
  if (is.double(support(x)[[1]])) {
    limits = quantile(x, c(0, 1))
    tryCatch(
      integrate(function(y) density(x, y) * y, limits[1], limits[2])$value,
      error = function(e) NA_real_
    )
  } else {
    mean(quantile(x, stats::ppoints(1000)), na.rm = TRUE)
  }
}

Yeilding a result quite close to the correct result:

> mean(dist_wrap("lnorm", 0, 1))
[1] 1.648721

And something similar could probably also be used to improve the calculations in variance.dist_default().

mitchelloharawild commented 2 years ago

Added, thanks!

mjskay commented 2 years ago

Lovely, thank you!