Closed mjskay closed 3 years ago
I don't have a lot of experience working with using log probabilities, but is there any benefit of having a parameter for this rather than taking the log afterwards?
Presumably there's some performance improvements that are possible using an alternate expression for the density/cdf/quantiles.
library(distributional)
dnorm(0:5, mean = 0, sd = 1, log = TRUE)
#> [1] -0.9189385 -1.4189385 -2.9189385 -5.4189385 -8.9189385 -13.4189385
log(density(dist_normal(0,1), 0:5))
#> [1] -0.9189385 -1.4189385 -2.9189385 -5.4189385 -8.9189385 -13.4189385
pnorm(0:5, mean = 0, sd = 1, log.p = TRUE)
#> [1] -6.931472e-01 -1.727538e-01 -2.301291e-02 -1.350810e-03 -3.167174e-05
#> [6] -2.866516e-07
log(cdf(dist_normal(0,1), 0:5))
#> [1] -6.931472e-01 -1.727538e-01 -2.301291e-02 -1.350810e-03 -3.167174e-05
#> [6] -2.866516e-07
Created on 2020-07-12 by the reprex package (v0.3.0)
There are both speed and accuracy benefits --- speed since some formulas on the log scale can be simpler, and accuracy because probabilities and densities close to zero don't get clipped to zero by floating point underflow.
I didn't give a very good example above since it didn't show this problem, but if you go a little further out in the Normal you can see it:
> dnorm(30:40)
[1] 1.473646e-196 8.363952e-210 1.746366e-223 1.341420e-237 3.790526e-252 3.940396e-267 1.506905e-282
[8] 2.120007e-298 1.097221e-314 0.000000e+00 0.000000e+00
> log(dnorm(30:40))
[1] -450.9189 -481.4189 -512.9189 -545.4189 -578.9189 -613.4189 -648.9189 -685.4189 -722.9189 -Inf
[11] -Inf
> dnorm(30:40, log = TRUE)
[1] -450.9189 -481.4189 -512.9189 -545.4189 -578.9189 -613.4189 -648.9189 -685.4189 -722.9189 -761.4189
[11] -800.9189
dnorm(30:40)
is returning some values as 0 that should actually be numbers close to 0, but they are too small to be represented in a double-precision floating point format.
As a result people often work on the log scale, especially when multiplying probabilities together a lot, where things can get small quickly.
Thinking about how this could work, I think it makes sense to have log_*()
generics to do this.
The implementation for log probabilities can vary substantially, and having a default method that calls log(density())
would be nice.
Interface wise, I think starting with the log
argument would be safe, and internally having it call the log_*()
generic if log=TRUE
.
What's the motivation for making it a separate function rather than using an argument? The argument-based approach would make most of the implementations very easy, since in most cases it looks like you're already calling through to existing distribution functions that already support the log / log.p arguments. So you could just pass it on down and get support for free, with minimal maintenance or implementation cost.
While most distributions have the log
/ log.p
arguments, as I understand it the methods/code required to compute the log density / log probability is substantially different.
If the package develops beyond distributions available in {stats} or packages lean enough to include as imports, I imagine separating the two methods would be beneficial.
More practically, having log_*()
generics (at least internally) would allow log_*.dist_default()
to be defined, which would call log(*(dist))
.
As an aside, is there any importance to using log.p
rather than log
for cdf()
/quantile()
distribution methods?
This is also used in {distributions3} which provides distributions3::log_pdf()
, without the option to use distributions3::pdf(log = TRUE)
.
I'm about to start adding log.p
support for cdf()
and quantile()
, could you comment on the usage of log
vs log.p
for this argument?
Hmm I don't have a strong opinion on that one... I'd probably go with whatever fits your sensibilities in an API. Personally I'd probably make them all log
for consistency.
I was leaning toward log
also. Especially as the argument for cdf()
and quantile()
isn't p
.
Thanks.
Closing as this now works (at least using the default methods as fallbacks).
Adding other log methods for cdf()
and quantile()
will be done with lower priority on an 'as used/needed' basis.
It would be very helpful to include an option to produce / take log densities and log probabilities as arguments, as the base R distribution functions do. This makes it easier to program against distributions in a numerically stable way.
Perhaps a
log
argument todensity()
,cdf()
, andquantile()
? (That would probably be more consistent thanlog
for the density function andlog.p
for the quantile function and cdf, as base R does it).e.g. allow this:
which would be equivalent to this:
And for
quantile()
settinglog = TRUE
would mean itsprobs
argument is treated as log probabilities, and forcdf()
settinglog = TRUE
would cause it to produce log probabilities.In most cases I think it's a relatively straightforward change of passing through one argument. Happy to submit a PR if there's interest.