mitchelloharawild / distributional

Vectorised distributions for R
https://pkg.mitchelloharawild.com/distributional
GNU General Public License v3.0
97 stars 15 forks source link

mean and variance are incorrect for transformed distributions #102

Closed venpopov closed 5 months ago

venpopov commented 6 months ago

The mean and variance methods for transformed distributions give incorrect results. This is unrelated to recent changes, as it also happens on the cran version. Here are two examples:

library(distributional)
library(ggdist)
library(ggplot2)
library(magrittr)

# Define a lognormal distribution either by the default or by transforming a normal distribution
mu = 2
sd = 1.5
trans_lnorm <- exp(dist_wrap('norm', mean = mu, sd = sd))
def_lnorm <- dist_lognormal(mu = mu, sigma = sd)

# plot shows same density
data.frame(dists = c(def_lnorm, trans_lnorm),
           names = c('Builtin lognormal', 'Transformed normal')) %>%
  ggplot(aes(xdist = dists, y = names)) +
  stat_slab(p_limits = c(0, 0.9)) +
  theme_ggdist() +
  ylab("")


# means and variance are incorrect for the transformed distribution
data.frame(type = c('analytic', 'builtin', 'transformed'),
           mean = c(exp(mu + sd^2/2), 
                    mean(def_lnorm), 
                    mean(trans_lnorm)),
           var = c((exp(sd^2) - 1) * exp(2*mu + sd^2),
                   variance(def_lnorm),
                   variance(trans_lnorm)))
#>          type     mean       var
#> 1    analytic 22.75990 4396.7560
#> 2     builtin 22.75990 4396.7560
#> 3 transformed 15.70174  261.0474
# same with gumbel
def_gumb <- dist_gumbel(alpha = mu, scale = sd)
trans_gumb <- -log(-log(dist_uniform(0, 1))) * sd + mu

data.frame(dists = c(def_gumb, trans_gumb),
           names = c('Builtin gumbel', 'Transformed uniform')) %>%
  ggplot(aes(xdist = dists, y = names)) +
  stat_slab(p_limits = c(0.0001, 0.995)) +
  theme_ggdist() +
  ylab("")


data.frame(type = c('analytic', 'builtin', 'transformed'),
           mean = c(mu + 0.5772 * sd, 
                    mean(def_gumb),
                    mean(trans_gumb)),
           var = c((pi^2/6) * sd^2,
                   variance(def_gumb),
                   variance(trans_gumb)))
#>          type     mean      var
#> 1    analytic 2.865800 3.701102
#> 2     builtin 2.865823 3.701102
#> 3 transformed 2.709438 1.612015

Created on 2024-04-04 with reprex v2.1.0

venpopov commented 6 months ago

Now that we don't get NA values after #97, there is an easy fix:

mean_new <- function(x, ...) {
  fun <- function(at, dist) {
    unlist(density(dist, at)) * at
  }
  integrate(fun, -Inf, Inf, dist = x, rel.tol = .Machine$double.eps^0.5)$value
}

covariance_new <- function(x, ...) {
  fun <- function(at, dist) {
    unlist(density(dist, at)) * at^2
  }
  mean <- mean_int(dist)
  integrate(fun, -Inf, Inf, dist = x, rel.tol = .Machine$double.eps^0.5)$value - mean^2
}

mean_new(trans_lnorm)
#> [1] 22.7599
var_int(trans_lnorm)
#> [1] 4396.756
mitchelloharawild commented 5 months ago

Currently these are calculated with taylor series approximations, which are faster than the numerical integration and I think work fine in most cases. Maybe numerical integration would be fast enough with a higher tolerance, while still giving comparable accuracy but with improved consistency.

mitchelloharawild commented 5 months ago

Closing as this is a duplicate of #73, you can add your examples or comments in that thread.