Inaccurate HDRs - Githubissues

mitchelloharawild / distributional

Vectorised distributions for R

https://pkg.mitchelloharawild.com/distributional

GNU General Public License v3.0

100 stars 16 forks source link

Inaccurate HDRs #123

Open robjhyndman opened 3 months ago

robjhyndman commented 3 months ago

library(distributional)
hdr(dist_normal())
#> <hdr[1]>
#> [1] [-1.95194, 1.95194]95
qnorm(0.975)
#> [1] 1.959964

^{Created on 2024-08-12 with reprex v2.1.1}

mitchelloharawild commented 3 months ago

This is largely based on the grid size of the quantiles (same as is done in hdrcde).

Gradient descent based methods would produce more accurate solutions starting from suitable hdrs from the grid method.

library(distributional)
hdr(dist_normal(), n = 1e6)
#> <hdr[1]>
#> [1] [-1.959956, 1.959956]95
qnorm(0.975)
#> [1] 1.959964

^{Created on 2024-08-12 with reprex v2.1.0}

robjhyndman commented 3 months ago

Maybe set n a little higher. quantiles should be relatively quick to compute for most distributions. Setting n=1e5 is still relatively quick for dist_normal() I think 4 accurate significant figures is a reasonable objective.

mitchelloharawild commented 2 months ago

What do you think about root finding algorithms for this, starting from a suitable location chosen from a course grid?

robjhyndman commented 2 months ago

Maybe. Perhaps try a speed comparison on dist_normal(). I've tested the effect of grid size on accuracy, and it has some really weird numerical behaviour.

library(distributional)
library(ggplot2)
df <- data.frame(
  n = exp(seq(log(50), log(5e3), length = 1e3)),
  upper = NA_real_
)
for (i in seq(NROW(df))) {
  df$upper[i] <- unlist(hdr(dist_normal(), n = df$n[i]))["upper"]
}
df |>
  ggplot(aes(x = n, y = upper)) +
  geom_line() +
  geom_hline(yintercept = qnorm(0.975), col = "red") +
  scale_x_log10()

^{Created on 2024-09-17 with reprex v2.1.1}

robjhyndman commented 2 months ago

I started writing a root finding function, and then realised why I didn't do it this way in the first place. Every time you do a calculation of the HDR coverage for a generic continuous distribution, you need to compute the density at a large number of observations, because the algorithm uses Monte Carlo integration. So with root finding, it would require far more computations than having a fine grid.

Of course, we could have hdr.dist_xxx() functions for specific distributions that would be much faster. For symmetric distributions the computation is easily done from the cdf. So perhaps we could at least add hdr.dist_normal() and similar for other symmetric distributions, and change the value of n in hdr.dist_default() to about 5000?

I can do a PR along these lines if you want.