Closed jtimonen closed 5 years ago
I diagnosed where the flaw occurs in the bayesplot code. The problem is that each band of color is scaled to fill 95% of the vertical space, which implicitly assumed that the median and mean also corresponded to the mode.
As I document in my blog post, the mcmc areas plot is made by layering on two or three ridgeline plots.
library(bayesplot)
library(tidyverse)
theme_set(theme_grey())
set.seed(123)
K <- 2
N <- 1000
X <- array(0,c(N,1,K))
for(k in 1:K){
x <- rgamma(N, 1, 5)
X[,1,k] <- x
}
dimnames(X)[[3]] <- as.list(paste("x", 1:K, sep="_"))
# Separate the full range so it can be drawn on each facet.
full_range <- X %>%
# mutate(Parameter = as.character(Parameter)) %>%
mcmc_areas_data(prob_outer = 0.9, point_est = "median") %>%
filter(interval == "outer") %>%
select(-interval)
d <- mcmc_areas_data(X) %>%
mutate(
interval = factor(interval, c("outer", "inner", "point")))
# Separate the full range so it can be drawn on each facet.
full_range <- d %>%
filter(interval == "outer") %>%
select(-interval)
p_base <- ggplot(d) +
aes(x = x, y = parameter, fill = interval, height = density) +
# For reference include the full-range.
# `stat = "identity"` means that ggridges should not scale the heights
ggridges::geom_density_ridges(
stat = "identity", data = full_range, color = "grey80",
size = 1, fill = NA, scale = .95) +
ggridges::geom_density_ridges(stat = "identity", scale = .95) +
scale_fill_brewer(type = "seq") +
guides(fill = FALSE)
p_base +
facet_wrap("interval")
In all of my development, the point estimate corresponded to the mode of the density curve, so that the outer interval, inner interval and point interval all contained the highest point of the curve. But this distribution breaks that assumption. The scale = .95
argument in geom_density_ridges()
scales the ridgelines vertically to fill up .95 of the vertical space allotted to them. For the point interval, because it does not include the highest point in the density curve, the scaling stretches the point interval outside of the density curve.
The obvious fix is to scale each interval relative to the maximum height on the whole curve first. geom_density_ridges()
supports an aesthetic mapping for scale (i.e., aes(scale = ...)
so it can read the scaling from a dataframe column. So we can compute the scaling for each interval.
d <- d %>%
mutate(overall_max = max(density)) %>%
group_by(interval) %>%
mutate(local_max = max(density)) %>%
ungroup() %>%
mutate(local_scale = local_max / overall_max)
Then plotting like before
ggplot(d) +
aes(x = x, y = parameter, fill = interval, height = density) +
ggridges::geom_density_ridges(
stat = "identity", data = full_range, color = "grey80",
size = 1, fill = NA, scale = .95) +
ggridges::geom_density_ridges(
# Get the scaling factor from a dataframe column
aes(scale = .95 * local_scale),
stat = "identity") +
scale_fill_brewer(type = "seq") +
guides(fill = FALSE) +
facet_wrap("interval")
I can make this fix for our code.
But for a more durable solution, we might consider using newly added the quantile calculations in ggridges, as demoed in the vignette.
The point estimate (in dark blue) gets plotted like this, when I have multiple parameters to plot. Below is a code to reproduce the issue.
If I change median to mean, the height of the dark blue area is still off. If I change K to 1, it looks as it should. I am using version 1.6.0.