jamesotto852 / ggdensity

An R package for interpretable visualizations of bivariate density estimates
https://jamesotto852.github.io/ggdensity/
Other
230 stars 14 forks source link

Scale fail? #1

Closed dkahle closed 3 years ago

dkahle commented 3 years ago

Any ideas what's going on here? Seems like the scale of the numbers matters.

library("tidyverse"); theme_set(theme_minimal())
  library("ggdensity")

  N <- 1e3
  df <- tibble(x = rnorm(N, 0, 200), y = rnorm(N, 0, 200)) 

  ggplot(df, aes(x, y)) + geom_point()

  ggplot(df, aes(x, y)) + geom_hdr()


  df <- tibble(x = rnorm(N, 0, 1), y = rnorm(N, 0, 1)) 

  ggplot(df, aes(x, y)) + geom_point()

  ggplot(df, aes(x, y)) + geom_hdr()

Created on 2021-09-25 by the reprex package (v2.0.1)

jamesotto852 commented 3 years ago

I've been digging into this, it's a very strange bug. It looks like uniroot is the culprit, it is not finding the correct density values which yield the desired HDRs. Below, I have included a reprex which matches what's going on internally (I've verified through the use of browser). Notice, the output of find_cutoff makes no sense -- it should be on the same scale as df$fhat and should be non-zero.

I've also included the full output of uniroot for the 50% HDR. Notice, f.root is nowhere near 0 and iter is small. (I have tried with higher values of maxiter, but it hasn't fixed the issue -- it still stops at 27). The value of estim.prec indicates to me that uniroot "knows" something is wrong, but I don't know how to fix it.

Have you run into something like this before? Do you know what might be causing this?

library("tidyverse"); theme_set(theme_minimal())
library("ggdensity")

N <- 1e3
df <- tibble(x = rnorm(N, 0, 200), y = rnorm(N, 0, 200)) 

h <- c(MASS::bandwidth.nrd(df$x), MASS::bandwidth.nrd(df$y))

kdeout <- MASS::kde2d(
             x = df$x, y = df$y, n = c(100, 100), h = h,
             lims = c(
               scales::expand_range(range(df$x), .10),
               scales::expand_range(range(df$y), .10)
             )
           )

df <- expand.grid(x = kdeout$x, y = kdeout$y)

normalize <- function(v) v / sum(v)

df$fhat <- as.vector(kdeout$z)
df$fhat_discretized <- normalize(df$fhat)

prob_above_c <- function(df, c) {
  if (length(c) > 1) return(sapply(c, prob_above_c))
  sum(df$fhat_discretized[df$fhat >= c])
}

find_cutoff <- function(df, conf) {
  if (length(conf) > 1) return(vapply(conf, \(x) find_cutoff(df, x), numeric(1)))
  uniroot(\(c) prob_above_c(df, c) - conf, lower = 0, upper = 1e4, check.conv = TRUE)$root
}

find_cutoff(df, c(.99, .95, .80, .5))
#> [1] 0.000000e+00 0.000000e+00 0.000000e+00 7.450581e-05

uniroot(\(c) prob_above_c(df, c) - .5, lower = 0, upper = 1e4, check.conv = TRUE)
#> $root
#> [1] 7.450581e-05
#> 
#> $f.root
#> [1] -0.5
#> 
#> $iter
#> [1] 27
#> 
#> $init.it
#> [1] NA
#> 
#> $estim.prec
#> [1] 7.450581e-05

Created on 2021-10-04 by the reprex package (v2.0.0)

jamesotto852 commented 3 years ago

Okay, I've fixed the problem. It seems that if df$fhat is on too small of a scale, uniroot fails in the find_cutoff function. I've implemented the fix here: 3ada8f0. I have also included a reprex below, showing that your example no longer fails.

library("tidyverse"); theme_set(theme_minimal())
library("ggdensity")

N <- 1e3
df <- tibble(x = rnorm(N, 0, 200), y = rnorm(N, 0, 200)) 

ggplot(df, aes(x, y)) + geom_point()

ggplot(df, aes(x, y)) + geom_hdr()

df <- df / 200

ggplot(df, aes(x, y)) + geom_point()

ggplot(df, aes(x, y)) + geom_hdr()

Created on 2021-10-21 by the reprex package (v2.0.0)