tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.54k stars 2.03k forks source link

Way to request `stat_bin()` to inherit breaks from the scale #6159

Open arcresu opened 1 month ago

arcresu commented 1 month ago

I often need to produce histograms where the x axis uses a date scale, typically binned by day, week, or month. The only sensible end result is where the scale's breaks align with the bins, but the existing methods I'm aware of for getting there are a bit fragile:

library(ggplot2)

set.seed(2024)
df <- data.frame(date = as.Date("2024-01-01") + rnorm(100, 0, 5))

Use geom_bar() and a binned scale

ggplot(df, aes(date)) +
  geom_bar() +
  scale_x_binned(
    transform = scales::transform_date(),
    breaks = scales::breaks_width("1 week")
  )
#> Warning in scale_x_binned(transform = scales::transform_date(), breaks = scales::breaks_width("1 week")): Ignoring `n.breaks`. Use a breaks function that supports setting number of
#> breaks.

Nice because the binning is specified only once, but now the whole scale is binned, so I can't for example add a geom_vline() to mark a specific date on the axis, since the vertical line would then be snapped into a bin by the scale transform.

Use stat_bin()


ggplot(df, aes(date)) +
  geom_histogram(binwidth = 7, closed = "right") +
  scale_x_date(date_breaks = "1 week")

The naive approach leaves the scale breaks and the bins unaligned (offset by 0.5 days here). Of course this can be improved by specifying a bin boundary or manually passing breaks but this gets a bit fiddly and fragile.

Since #5963 there's a better workaround:


ggplot(df, aes(date)) +
  geom_histogram(breaks = function(x) { scales::breaks_width("1 week")(as.Date(range(x))) }) +
  scale_x_date(date_breaks = "1 week")

Created on 2024-10-25 with reprex v2.1.1

which is the result I want. However, there's duplication of the breaks and transforms between the scale and the stat. Ideally I'd like a way to request stat_bin() to just use the scale's breaks.

It's technically possible, since StatBin::compute_group (where the bins are computed) already has access to the scale object, but I'm not sure if it violates any sort of ggplot API encapsulation principles to have the scale directly affecting the stat's output in the way I'm proposing.

The same situation applies for stat_bin_2d() and stat_summary_bin(). I'd be happy to open a PR if there's agreement about the idea. I'm imagining either new value/s for breaks or a new param mutually exclusive with breaks that lets users choose to use the corresponding scale's major or minor breaks for the stat's binning breaks.

teunbrand commented 1 month ago

On the one hand, I like the idea. On the second hand, I don't think it can be implemented cleanly.

The issue is that scales recompute their ranges, which form the basis for the breaks, in between when the stats are calculated and when the graphics are drawn. It means that another layer can invalidate the breaks that are used for binning, and the scale ends up displaying different breaks. However, this should not be an issue if fixed breaks are used.

To demonstrate the principle, we can make a quick and dirty extension that takes breaks from the scale. We see that it doesn't really work well because the computed bins are less wide than the full data and the final breaks end up different than the intermediate breaks.

library(ggplot2)
set.seed(2024)
df <- data.frame(date = as.Date("2024-01-01") + rnorm(100, 0, 5))

StatBin2 <- ggproto(
  "StatBin2", StatBin,
  compute_panel = function(self, data, scales, breaks = NULL, ...) {
    breaks <- breaks %||% scales$x$get_transformation()$inverse(scales$x$get_breaks())
    ggproto_parent(StatBin, self)$compute_panel(data, scales, breaks = breaks, ...)
  }
)

p <- ggplot(df, aes(date)) +
  geom_histogram(stat = StatBin2)
p
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`.

However this works fine when breaks are fixed.

p + scale_x_date(breaks = "5 days")
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2024-10-25 with reprex v2.1.1

I'm not sure if it violates any sort of ggplot API encapsulation principles to have the scale directly affecting the stat's output in the way I'm proposing.

I think the principle ggplot2 tries to adhere to is that scales and layers only communicate through the data and not directly with oneanother. On a personal level, I think it is fine to read out scale settings at the Stat$$compute_group() stage, but not fine to write scale settings. Pre-computing breaks and setting these at the scale's breaks should not happen.

arcresu commented 4 weeks ago

Thanks for the detailed reply. After posting I experimented in an extension and discovered the hitch about the ranges being recomputed that you mentioned. It seems that it kind of works without fixed breaks for the scale as long as you're careful that additional layers are within the range of the data for the "main" layer and you disable scale expansion...

I might have some details wrong, but my understand is that first the scales are trained on the mapped data for all layers to compute the initial range. Then the stats are computed, then finally the scales are reset and trained again via the facet. If the scale doesn't have fixed breaks then the breaks are recomputed whenever the range changes. For fixed breaks, the only thing that's done is to discard any breaks outside the final range. The main reason for the scale re-training is to give the facets control over whether the range should be synchronised or free between panels.

Given all that, it seems what we'd need is for a type of breaks on the scale that's in between computed and fixed. The scale could compute the breaks at the first step (before stats), and from then on act as though those breaks were passed as fixed breaks, including preserving them across re-training. I think that wouldn't break any core principles, because the retraining is primarily about the range (which would still be reset) rather than the breaks. The scale expansion etc. could continue to work as intended but just wouldn't cause the breaks to be changed, which is already how fixed breaks work.

So in summary, the binning stats would get a "inherit breaks from scale" option and the (continuous?) position scales would get a "compute breaks once before stats then freeze them" option. I'll continue experimenting.

teunbrand commented 3 weeks ago

I understand why this is useful and agree that it would be nice to have, but the idea doesn't mesh particularly well with the way ggplot2 is implemented. I'm not sure that upending the break calculation, which are already not straightforward due to plenty heuristics, is worth saving the effort of having to provide breaks twice. However if you find a way in your experiments to achieve this in a minimally invasive way, we'd be happy to reconsider.