tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.48k stars 2.02k forks source link

Robust `position_dodge(preserve = "single")` #5928

Closed teunbrand closed 4 months ago

teunbrand commented 4 months ago

This PR aims to fix #2801 and revives #2813.

Briefly, instead of counting the number of rows per PANEL/xmin interaction, we are counting the number of unique groups per PANEL/xmin interaction. Argueably, this is the metric that should be counted.

The reprex from the issue now shows violin plots of appropriate widths:

devtools::load_all("~/packages/ggplot2")
#> ℹ Loading ggplot2

ggplot(mtcars, aes(factor(cyl), mpg, fill = factor(vs))) +
  geom_violin(position = position_dodge(preserve = "single"))
#> Warning: Groups with fewer than two datapoints have been dropped.
#> ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

Created on 2024-06-03 with reprex v2.1.0

One concern raised in https://github.com/tidyverse/ggplot2/pull/2813#issuecomment-420779013 was performance. This PR uses {vctrs} and is faster than the current implementation.

``` r library(vctrs) old_strategy <- function(data) { panels <- unname(split(data, data$PANEL)) ns <- vapply(panels, function(panel) max(table(panel$xmin)), double(1)) max(ns) } new_strategy <- function(data) { n <- vec_unique(data[c("group", "PANEL", "xmin")]) n <- vec_group_id(n[c("PANEL", "xmin")]) max(tabulate(n, attr(n, "n"))) } data <- transform( ggplot2::diamonds, PANEL = 1L, group = interaction(color, clarity), xmin = clarity ) # Replicate boxplot: 1-row per group/xmin combination data <- data[!duplicated(data[c("PANEL", "group", "xmin")]), ] bench::mark( old_strategy(data), new_strategy(data) ) #> # A tibble: 2 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 old_strategy(data) 99.6µs 103.6µs 9235. 133.3KB 43.1 #> 2 new_strategy(data) 29.2µs 30.5µs 31879. 16.8KB 38.3 # If we are generous towards the 'old strategy' by having smaller data to split data <- data[c("PANEL", "group", "xmin")] bench::mark( old_strategy(data), new_strategy(data) ) #> # A tibble: 2 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 old_strategy(data) 58.9µs 61µs 16072. 5.62KB 39.7 #> 2 new_strategy(data) 29.3µs 30.2µs 32595. 4.03KB 39.2 ``` Created on 2024-06-03 with [reprex v2.1.0](https://reprex.tidyverse.org)