tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.48k stars 2.02k forks source link

Violin quantiles are based on observations #5912

Open teunbrand opened 4 months ago

teunbrand commented 4 months ago

This PR aims to fix #4120.

Briefly, instead of GeomViolin estimating quantiles based on the density, StatYdensity includes quantile as a computed variable. GeomViolin is then able to draw the quantiles based on that computed variable.

I consider this a breaking change as StatYdensity how 'owns' the draw_quantiles parameter, rather than GeomViolin. For users, this should only matter when they are breaking the link between geom_violin() and stat_ydensity(). It is also a visual change, as the quantiles are now correct.

Reprex from the issue, notice how the red quantiles of the violin now aligns with the boxplot quantiles.

devtools::load_all("~/packages/ggplot2")
#> ℹ Loading ggplot2
set.seed(5)

types <- list(
  rnorm(n = 20, mean = 5),
  c(10, 9, 9, 9, 9, 7, 7, 6, 5, 4, 1),
  c(rnorm(n = 10, mean = 2.5), rnorm(n = 10, mean = 7.5, sd = 0.5))
)

df = data.frame(
  type = rep(paste0("Type ", seq_along(types)), lengths(types)),
  val  = unlist(types)
)

ggplot(df, aes(x = type, y = val)) + theme_classic() + 
  geom_boxplot(alpha = 0.5) + 
  geom_violin(scale = "area", alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75), 
              colour = "red") +
  geom_dotplot(binaxis = "y", stackdir = "center", alpha = 0.3, dotsize = 0.4)
#> Bin width defaults to 1/30 of the range of the data. Pick better value with
#> `binwidth`.

Created on 2024-05-28 with reprex v2.1.0

thomasp85 commented 2 months ago

I'm considering whether draw_quantiles should still be a parameter of the geom. It seems wrong that drawing is controlled by a stat.

Would it make sense to have the stat always compute the quantile, and the geom drawing it depending on the value of draw_quantile?

teunbrand commented 2 months ago

In principle I agree with you, though there are some practicalities that may get in the way. The issue is that draw_quantiles currently is not a boolean and carries the numeric values for which quantiles the stat should know about, but the geom doesn't need to know about. Another issue is what the default quantiles that are computed should be, but it'd probably should be c(0.25, 0.5, 0.75).