tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.55k stars 2.03k forks source link

Can ggplot2 have a Stat that simply summarises data by group? #3501

Open yutannihilation opened 5 years ago

yutannihilation commented 5 years ago

Every time I encounter a question like #3497, I wonder why ggplot2 doesn't have a Stat that simply applies a function by group. Though, in terms of the computational efficiency, it's generally better to have a summarised version of the data before entering ggplot2, it would be handy if we can summarise in ggplot2 especially when we generate plots one after another with different groupings.

I believe StatSummary could have been implemented to be able to summarise data with other groupings than c("group", "x") because the code following seems very general one:

https://github.com/tidyverse/ggplot2/blob/b8420241309c8eea00d7086002c01cdf38a50eac/R/stat-summary.r#L163-L169

But, as the current make_summary_fun() expects a function that takes a vector, not a data.frame, it would be difficult to expand StatSummary to accept a function that summarises both x and y. So, to satisfy the need, I feel it might be nice to have some simple geom like below.

I don't see reasons why we shouldn't implement such a Stat. Am I missing something...?

library(ggplot2)

stat_summary_by_group <- function(mapping = NULL, data = NULL,
                                  geom = "pointrange", position = "identity",
                                  ...,
                                  fun.data = NULL,
                                  na.rm = FALSE,
                                  show.legend = NA,
                                  inherit.aes = TRUE) {
  layer(
    data = data,
    mapping = mapping,
    stat = StatSummaryByGroup,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      fun.data = fun.data,
      na.rm = na.rm,
      ...
    )
  )
}

StatSummaryByGroup <- ggproto("StatSummaryByGroup", Stat,
  compute_group = function(data, scales, fun.data = NULL, na.rm = FALSE) {
    summary <- fun.data(data)
    unique <- ggplot2:::dapply(data, c("group"), ggplot2:::uniquecols)
    unique[names(summary)] <- summary
    unique
  }
)

d <- data.frame(x = c(1:5, 3:7), y = 1:10, g = rep(c("a", "b"), each = 5), stringsAsFactors = FALSE)
f <- function(d) {
  data.frame(x = min(d$x), xend = max(d$x), y = mean(d$y), yend = mean(d$y))
}

ggplot(d) +
  geom_point(aes(x, y, colour = g)) +
  stat_summary_by_group(fun.data = f, aes(x, y, xend = stat(xend), yend = stat(yend)), geom = "segment") +
  facet_grid(cols = vars(g))

Created on 2019-08-24 by the reprex package (v0.3.0)

thomasp85 commented 5 years ago

I agree this makes some sense, and will be a good fallback for situations where the provides stats does t do exactly what the user need. One thing that complicated it all is in terms of documenting what kind of columns should get returned. This is quite dependent on the geom it gets coupled with and will require some knowledge on how ggplot2 works