tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.39k stars 2k forks source link

Fix overlap identification in `position_dodge2()` #5939

Open teunbrand opened 3 weeks ago

teunbrand commented 3 weeks ago

This PR aims to fix #4327 and fix #5938.

Briefly, find_x_overlap() is completely rewritten. The trick was to search wether xmin[i + 1] is larger than cummax(xmax)[i], then start a new range.

The reprex from #4327, note no misplacement of points:

devtools::load_all("~/packages/ggplot2")
#> ℹ Loading ggplot2

df <- data.frame(
  dimension = paste0("dim ", c(1, 1, 2, 2)),
  group = c("A", "B", "A", "B"),
  value = c(0.6, 0.7, 0.4, 0.6)
)

ggplot(df, aes(dimension, value, colour = group)) +
  geom_point(position = position_dodge2(width = 0))

Reprex from #5938, note range 3 is correctly identified as overlapping with 1 & 2.

f <- function(x, xend) find_x_overlaps(data.frame(xmin = x, xmax = xend))

df <- data.frame(
  xmin = c(1, 2, 3, 5),
  xmax = c(4, 3, 4, 6),
  group = factor(1:4)
)

ggplot(df, aes(x = xmin, xend = xmax, y = group)) +
  geom_segment(aes(colour = factor(f(xmin, xmax))))

Better dodging behaviour:

ggplot(df, aes(xmin = xmin, xmax = xmax)) +
  geom_rect(aes(ymin = 0, ymax = 1, fill = group),
            alpha = 0.75,
            position = position_dodge2())

Time considerations in details (tl;dr: it's faster)

``` r new <- function(df) { start <- df$xmin nonzero <- df$xmax != df$xmin missing <- is.na(df$xmin) | is.na(df$xmax) end <- cummax(c(df$xmax[1], df$xmax[-nrow(df)])) overlaps <- cumsum(start > end | (start == end & nonzero)) overlaps[missing] <- seq_len(sum(missing)) + max(overlaps, na.rm = TRUE) match(overlaps, unique0(overlaps)) } old <- function(df) { overlaps <- numeric(nrow(df)) overlaps[1] <- counter <- 1 for (i in seq_asc(2, nrow(df))) { if (is.na(df$xmin[i]) || is.na(df$xmax[i - 1]) || df$xmin[i] >= df$xmax[i - 1]) { counter <- counter + 1 } overlaps[i] <- counter } overlaps } df <- data.frame(xmin = runif(1000, max = 100)) df$xmax <- df$xmin + runif(1000, max = 2) df <- df[order(df$xmin, df$xmax), ] bench::mark( new(df), old(df), check = FALSE ) #> # A tibble: 2 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 new(df) 29.4µs 30.8µs 27374. 99.9KB 106. #> 2 old(df) 1.53ms 1.61ms 579. 95.5KB 8.42 ``` Also tested for smaller `nrow(df)` and new is faster than old from `nrow(df) == 5` onwards.

Created on 2024-06-07 with reprex v2.1.0