Open mjskay opened 2 years ago
The Taylor series approximation used to compute the mean of log-Normal distributions is appropriate for small variances. Some relevant discussion can be found here: https://robjhyndman.com/hyndsight/backtransforming/#comment-3043970592 In most forecasting applications this approximation results in a relatively small inaccuracy in the forecast distribution's mean.
The log normal distribution is sufficiently common to justify a specific distribution, but I'm concerned about how this inaccuracy might impact non-log transformations.
One option could be to sample the mean, the quantile()
and generate()
methods are accurate. But this would substantially degrade performance for a task commonly done at scale in forecasting.
@robjhyndman - do you have any suggestions for improving this approximation?
The Taylor series approximation up to 2nd derivatives has the nice property that it is very general provided the transformation is differentiable, but it is not necessarily very accurate. I can think of three possible approaches:
Great, thanks for the suggestions. I think a good start would be to add better methods for common transformations and fallback to Taylor series approximations otherwise. I think sampling for the mean will be too slow for most applications, but would be useful to provide as an option for users.
OK. To start, a more accurate version for exp() has mean exp(m + v/2) and variance [exp(v)-1]exp(2m+v) where the original distribution has mean m and variance v. But if we are going to allow special cases, then perhaps identify when exp is applied to a normal distribution, and save it as logNormal rather than t(N).
I've added the log-normal distribution, and a shortcut between it and the normal with exp()
and log()
.
library(distributional)
exp(dist_normal())
#> <distribution[1]>
#> [1] lN(0, 1)
log(dist_lognormal())
#> <distribution[1]>
#> [1] N(0, 1)
Created on 2021-11-12 by the reprex package (v2.0.0)
Following up on the discussion started in #102
Currently these are calculated with taylor series approximations, which are faster than the numerical integration and I think work fine in most cases. Maybe numerical integration would be fast enough with a higher tolerance, while still giving comparable accuracy but with improved consistency.
Yes, I did notice they use taylor series approximations. But I'm not sure that they do work fine in most cases. The examples I posted use fairly typical parameter ranges, and the error is huge. When you say they work fine in most cases, do you have any results on the expected error under different transformations and distributions?
They wway it is now, I personally wouldn't use this functionality. Maybe the simplest fix is to have an argument to mean
and variance
that allows the user to choose which method (TS vs integration) to use. That way anyone can decide on what is the acceptable speed-accuracy trade-off for them
Btw, the speed issue is greatly improved by now having the symbolic derivatives. For example, on the master branch:
mean_new <- function(x, rel.tol = .Machine$double.eps^0.5, ...) {
fun <- function(at, dist) {
unlist(density(dist, at)) * at
}
integrate(fun, -Inf, Inf, dist = x, rel.tol = rel.tol)$value
}
trans_lnorm <- exp(dist_wrap('norm', mean = 2, sd = 1.5))
microbenchmark::microbenchmark(mean(trans_lnorm),
+ mean_new(trans_lnorm))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> mean(trans_lnorm) 514.468 542.4095 581.0446 575.1685 601.8595 902.205 100
#> mean_new(trans_lnorm) 103720.283 105202.5765 109171.2150 106897.0450 109197.6370 175162.168 100
And on the symmbolic derivatives branch:
Unit: microseconds
expr min lq mean median
mean(trans_lnorm) 540.093 553.828 606.679 568.629
mean_new(trans_lnorm) 10147.090 10281.221 10995.819 10466.890
and using low rel.tol in integrate such as 0.1 is only 7 times slower than the taylor series, but produces ~2000 times smaller error:
type mean microseconds
1 analytical 22.75990 0.58548
2 taylor series 15.70174 627.20898
3 integrate(rel.tol = .Machine$double.eps^0.5) 22.75990 11238.07048
4 integrate (rel.tol = 1e-5) 22.75990 8340.58449
5 integrate (rel.tol = 1e-3) 22.75982 6154.55551
6 integrate (rel.tol = 1e-1) 22.75512 4015.56993
I'm open to changing it, there are many options here. For example, mean.dist_default()
uses numerical integration. Originally it took a the average from a sufficiently large sample with generate()
. The current behaviour is for both speed and consistency with the forecast/fable packages, but I agree that accuracy is a more important consideration (@robjhyndman can comment on the accuracy of the TS approximation under different circumstances).
I also experimented with rel.tol
, and I think that something high like 0.1 by default is ok - of course this should be parameterised.
Here's some other examples the taylor series approach with different transformations:
library(distributional)
mu = 15
sd = 2
x <- dist_normal(mu, sd)^2
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 229.0715
mean(x)
#> [1] 229
x <- sqrt(dist_normal(mu, sd))
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 3.864325
mean(x)
#> [1] 3.864377
bc <- scales::boxcox_trans(0.5)
x <- dist_transformed(dist_normal(mu, sd), bc$transform, bc$inverse)
mean(x)
#> [1] 5.728753
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 5.728844
Created on 2024-04-19 with reprex v2.0.2
I guess the issue is much more pronounced when the inverse function is log
because the taylor series approximation for log has a very small radius of covergence, while the transformations you posted involve power and exp transformations that are well approximated by Taylor series
Yes, that's correct. This is why we have log(dist_normal())
-> dist_lognormal()
to catch this common case.
I'm not sure the exact problem, but it seems that the
mean()
of transformed distributions is not always correct. Here's an example ofexp()
applied todist_normal()
, which should yield a lognormal distribution, giving incorrect means:I'm not sure the issue. It does appear to work in some cases: