mitchelloharawild / distributional

Vectorised distributions for R
https://pkg.mitchelloharawild.com/distributional
GNU General Public License v3.0
94 stars 15 forks source link

Log densities / log probabilities #37

Closed mjskay closed 3 years ago

mjskay commented 4 years ago

It would be very helpful to include an option to produce / take log densities and log probabilities as arguments, as the base R distribution functions do. This makes it easier to program against distributions in a numerically stable way.

Perhaps a log argument to density(), cdf(), and quantile()? (That would probably be more consistent than log for the density function and log.p for the quantile function and cdf, as base R does it).

e.g. allow this:

density(dist_normal(0, 1), 0:5, log = TRUE)

which would be equivalent to this:

dnorm(0:5, mean = 0, sd = 1, log = TRUE)

And for quantile() setting log = TRUE would mean its probs argument is treated as log probabilities, and for cdf() setting log = TRUE would cause it to produce log probabilities.

In most cases I think it's a relatively straightforward change of passing through one argument. Happy to submit a PR if there's interest.

mitchelloharawild commented 4 years ago

I don't have a lot of experience working with using log probabilities, but is there any benefit of having a parameter for this rather than taking the log afterwards?

Presumably there's some performance improvements that are possible using an alternate expression for the density/cdf/quantiles.

library(distributional)
dnorm(0:5, mean = 0, sd = 1, log = TRUE)
#> [1]  -0.9189385  -1.4189385  -2.9189385  -5.4189385  -8.9189385 -13.4189385
log(density(dist_normal(0,1), 0:5))
#> [1]  -0.9189385  -1.4189385  -2.9189385  -5.4189385  -8.9189385 -13.4189385

pnorm(0:5, mean = 0, sd = 1, log.p = TRUE)
#> [1] -6.931472e-01 -1.727538e-01 -2.301291e-02 -1.350810e-03 -3.167174e-05
#> [6] -2.866516e-07
log(cdf(dist_normal(0,1), 0:5))
#> [1] -6.931472e-01 -1.727538e-01 -2.301291e-02 -1.350810e-03 -3.167174e-05
#> [6] -2.866516e-07

Created on 2020-07-12 by the reprex package (v0.3.0)

mjskay commented 4 years ago

There are both speed and accuracy benefits --- speed since some formulas on the log scale can be simpler, and accuracy because probabilities and densities close to zero don't get clipped to zero by floating point underflow.

I didn't give a very good example above since it didn't show this problem, but if you go a little further out in the Normal you can see it:

> dnorm(30:40)
 [1] 1.473646e-196 8.363952e-210 1.746366e-223 1.341420e-237 3.790526e-252 3.940396e-267 1.506905e-282
 [8] 2.120007e-298 1.097221e-314  0.000000e+00  0.000000e+00
> log(dnorm(30:40))
 [1] -450.9189 -481.4189 -512.9189 -545.4189 -578.9189 -613.4189 -648.9189 -685.4189 -722.9189      -Inf
[11]      -Inf
> dnorm(30:40, log = TRUE)
 [1] -450.9189 -481.4189 -512.9189 -545.4189 -578.9189 -613.4189 -648.9189 -685.4189 -722.9189 -761.4189
[11] -800.9189

dnorm(30:40) is returning some values as 0 that should actually be numbers close to 0, but they are too small to be represented in a double-precision floating point format.

As a result people often work on the log scale, especially when multiplying probabilities together a lot, where things can get small quickly.

mitchelloharawild commented 4 years ago

Thinking about how this could work, I think it makes sense to have log_*() generics to do this. The implementation for log probabilities can vary substantially, and having a default method that calls log(density()) would be nice.

Interface wise, I think starting with the log argument would be safe, and internally having it call the log_*() generic if log=TRUE.

mjskay commented 4 years ago

What's the motivation for making it a separate function rather than using an argument? The argument-based approach would make most of the implementations very easy, since in most cases it looks like you're already calling through to existing distribution functions that already support the log / log.p arguments. So you could just pass it on down and get support for free, with minimal maintenance or implementation cost.

mitchelloharawild commented 4 years ago

While most distributions have the log / log.p arguments, as I understand it the methods/code required to compute the log density / log probability is substantially different.

If the package develops beyond distributions available in {stats} or packages lean enough to include as imports, I imagine separating the two methods would be beneficial.

More practically, having log_*() generics (at least internally) would allow log_*.dist_default() to be defined, which would call log(*(dist)).

As an aside, is there any importance to using log.p rather than log for cdf()/quantile() distribution methods?

mitchelloharawild commented 4 years ago

This is also used in {distributions3} which provides distributions3::log_pdf(), without the option to use distributions3::pdf(log = TRUE).

mitchelloharawild commented 4 years ago

I'm about to start adding log.p support for cdf() and quantile(), could you comment on the usage of log vs log.p for this argument?

mjskay commented 4 years ago

Hmm I don't have a strong opinion on that one... I'd probably go with whatever fits your sensibilities in an API. Personally I'd probably make them all log for consistency.

mitchelloharawild commented 4 years ago

I was leaning toward log also. Especially as the argument for cdf() and quantile() isn't p. Thanks.

mitchelloharawild commented 3 years ago

Closing as this now works (at least using the default methods as fallbacks). Adding other log methods for cdf() and quantile() will be done with lower priority on an 'as used/needed' basis.