stan-dev / bayesplot

bayesplot R package for plotting Bayesian models
https://mc-stan.org/bayesplot
GNU General Public License v3.0
432 stars 83 forks source link

Point estimate plotted weirdly in mcmc_areas() #168

Closed jtimonen closed 5 years ago

jtimonen commented 6 years ago

kernel_densities

The point estimate (in dark blue) gets plotted like this, when I have multiple parameters to plot. Below is a code to reproduce the issue.

require(bayesplot)
set.seed(123)

K <- 2
X <- array(0,c(N,1,K))
for(k in 1:K){
  N <- 1000
  x <- rgamma(N, 1, 5)
  X[,1,k] <- x
}

dimnames(X)[[3]] <- as.list(paste("x", 1:K, sep="_"))
mcmc_areas(X, prob_outer=0.9, point_est = "median")

If I change median to mean, the height of the dark blue area is still off. If I change K to 1, it looks as it should. I am using version 1.6.0.

tjmahr commented 5 years ago

I diagnosed where the flaw occurs in the bayesplot code. The problem is that each band of color is scaled to fill 95% of the vertical space, which implicitly assumed that the median and mean also corresponded to the mode.

As I document in my blog post, the mcmc areas plot is made by layering on two or three ridgeline plots.

library(bayesplot)
library(tidyverse)
theme_set(theme_grey())

set.seed(123)

K <- 2
N <- 1000
X <- array(0,c(N,1,K))
for(k in 1:K){
  x <- rgamma(N, 1, 5)
  X[,1,k] <- x
}

dimnames(X)[[3]] <- as.list(paste("x", 1:K, sep="_"))

# Separate the full range so it can be drawn on each facet.
full_range <- X %>%
  # mutate(Parameter = as.character(Parameter)) %>%
  mcmc_areas_data(prob_outer = 0.9, point_est = "median") %>%
  filter(interval == "outer") %>%
  select(-interval)

d <- mcmc_areas_data(X) %>%
  mutate(
    interval = factor(interval, c("outer", "inner", "point")))

# Separate the full range so it can be drawn on each facet.
full_range <- d %>%
  filter(interval == "outer") %>%
  select(-interval)

p_base <- ggplot(d) +
  aes(x = x, y = parameter, fill = interval, height = density) +
  # For reference include the full-range.
  # `stat = "identity"` means that ggridges should not scale the heights
  ggridges::geom_density_ridges(
    stat = "identity", data = full_range, color = "grey80",
    size = 1, fill = NA, scale = .95) +
  ggridges::geom_density_ridges(stat = "identity", scale = .95) +
  scale_fill_brewer(type = "seq") +
  guides(fill = FALSE)

p_base +
  facet_wrap("interval")

image

In all of my development, the point estimate corresponded to the mode of the density curve, so that the outer interval, inner interval and point interval all contained the highest point of the curve. But this distribution breaks that assumption. The scale = .95 argument in geom_density_ridges() scales the ridgelines vertically to fill up .95 of the vertical space allotted to them. For the point interval, because it does not include the highest point in the density curve, the scaling stretches the point interval outside of the density curve.

The obvious fix is to scale each interval relative to the maximum height on the whole curve first. geom_density_ridges() supports an aesthetic mapping for scale (i.e., aes(scale = ...) so it can read the scaling from a dataframe column. So we can compute the scaling for each interval.

d <- d %>% 
  mutate(overall_max = max(density)) %>% 
  group_by(interval) %>% 
  mutate(local_max = max(density)) %>% 
  ungroup() %>% 
  mutate(local_scale = local_max / overall_max) 

Then plotting like before

ggplot(d) +
  aes(x = x, y = parameter, fill = interval, height = density) +
  ggridges::geom_density_ridges(
    stat = "identity", data = full_range, color = "grey80",
    size = 1, fill = NA, scale = .95) +
  ggridges::geom_density_ridges(
    # Get the scaling factor from a dataframe column
    aes(scale = .95 * local_scale), 
    stat = "identity") +
  scale_fill_brewer(type = "seq") +
  guides(fill = FALSE) +
  facet_wrap("interval")

image

I can make this fix for our code.

But for a more durable solution, we might consider using newly added the quantile calculations in ggridges, as demoed in the vignette.