tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.51k stars 2.03k forks source link

scale_x_binned() doesn't work with geom_tile() #5294

Open hughjonesd opened 1 year ago

hughjonesd commented 1 year ago

The following code works, producing output with xmin and xmax aligned to bins:

ggplot(mtcars, aes(mpg, hp)) + 
  geom_rect(aes(xmin = mpg - 5, xmax = mpg + 5, ymin = hp - 5, ymax = hp + 5)) + 
  scale_x_binned(breaks = seq(0,60, 10))

But this code, which ought to do the same thing, produces an empty plot:

ggplot(mtcars, aes(mpg, hp)) + 
  geom_tile(aes(x = mpg, y = hp, height = 10, width = 10)) + 
  scale_x_binned(breaks = seq(0,60, 10))

The underlying reason is in the second call to layout$map_position() in ggplot_build().

  1. There, the binned scale tries to remap x variables back from a factor to (the binned version of) their original values. For GeomRect which has xmax and xmin from the start, this works.
  2. But GeomTile calculates xmin and xmax from x and width. By the time it gets to layer$compute_geom_1, x has been transformed to a "factor"-style numeric of bins. The geom doesn't realise this and happily adds the original width to the bin.
  3. Then the second call to layout$map_position() takes this wonky data and turns it back, typically to NA.
  4. Finally when the geom displays, the NA values for xmin and xmax are removed.

In other words, GeomTile$setup_data() is being called after the first map_position(), but in this case at least, it needs to be called before it.

This bug exists in ggplot2 3.4.2, and also on github main as of today.

teunbrand commented 1 year ago

Thanks for the report. geom_tile() and geom_rect() indeed aren't equivalent under scale transformations. The binned scale is equivalent to a scale transformation. I agree that the example is undesirable, and we've recently added this bit to the documentation to make the difference more clear:

https://github.com/tidyverse/ggplot2/blob/f7246d4ad9aeee46890f60e3821d762f390378ed/R/geom-tile.R#L16-L21

hughjonesd commented 1 year ago

Sure, but that doesn't quite cover it. The size of the tiles isn't being determined after transformation... it's being determined wrongly, and then the tiles aren't being displayed.

I think this is a real bug. Here's an example where the tiles are actually displayed in the wrong place:

ggp <- ggplot(data.frame(x = 2:4 + 0.5, y = 2:4), aes(x, y)) + geom_tile(width = .8, height = .25)

ggp # These should bin to 2, 3 and 4...

# but in fact...
ggp + scale_x_binned(breaks = 2:4)
teunbrand commented 1 year ago

I'm sorry I don't quite understand. How are they displayed wrongly? I've rendered an example below.

library(ggplot2)

tiled <- ggplot(data.frame(x = 2:4 + 0.5, y = 2:4), aes(x, y)) + 
  geom_tile(width = .8, height = .25)

tiled

tiled + scale_x_binned(breaks = 2:4)

To me, it seems that geom_rect() is doing the wrong thing with equivalent parametrisation:

rects <- ggplot(data.frame(xmin = 2:4 + 0.1, xmax = 2:4 + 0.9,
                           ymin = 2:4 - 0.125, ymax = 2:4 + 0.125)) +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax))

rects

rects + scale_x_binned(breaks = 2:4)

Created on 2023-05-04 with reprex v2.0.2

hughjonesd commented 1 year ago

So my thought was: "2.5 - 0.4 = 2.1, should bin to 2; 2.5 + 0.4 = 2.9, should bin to 3". Actually I think the bins ought to be 2.5, 3.5 etc. i.e. midpoints of the breaks. But neither of those things are happening. Indeed, x and y are scaled and then width is not.

Here's a more extreme example:

ggp <- ggplot(data = NULL, aes(x, y)) + 
  geom_tile(data = data.frame(x = c(0, 5, 10), y = 1:3), width = 2, height = .25) + 
  geom_point(data = data.frame(x = c(-1, 1, 4, 6, 9, 11), y = rep(1:3, each = 2)), color = "red")
ggp 

image

ggp + scale_x_binned(breaks = c(0, 5, 10))

image

My expectation would be that the rectangle limits would be (-1, +1); (4,6) and (9, 11). The first and last ones have edges which are out of the limits of the binned scale, so maybe they are dropped, or maybe like the points they are just left alone. The second one would bin to (2.5, 7.5).

In fact: the first rectangle disappears. The second one goes to (-0.5, 7.5). The third one goes to (2.5, 10.5).

I don't think anyone would expect that - why would a rectangle (4,6) be mapped to (-0.5, 7.5) by binning to two bins from 0 to 5 and 5 to 10?

The real reason is that the first call to map_position has mapped x to c(1,2,3), representing the levels. Then the width gets calculated from this, creating xmin of c(0,1,2) and xmax of c(2,3,4). The second call to map_position then translates these back to their corresponding bin centres, creating xmin = c(NA, -0.5, 2.5) and xmax = c(2.5, 7.5, 10.5).

I don't think anyone who hasn't read the source code will understand this, or be able to use it for any practical purpose.

So yeah, the disclaimer in the documentation is better than nothing, but I think it would be simpler to just put "geom_tile doesn't work with binned scales".

Similar concerns apply with a logged scale:

ggp <- ggplot(data = NULL, aes(x, y)) + ylim(0,2)+
        geom_tile(data = data.frame(x = 10, y = 1), width = 2, height = .25) +
        geom_point(data = data.frame(x = c(9, 11), y = 1), color = "green")
ggp

image

ggp + scale_x_log10()

image

This makes it look as if 9 is 1 and 11 is 100. Again, you can say that it is working according to the documentation, but the point is, how is it meant to represent data?

My expectation as a user would be that I can use geom_tile to represent some data. Then if I choose to put that data on a log scale, or bin it or whatever, geom_tile keeps displaying the same answers using the new scale.

jfmusso commented 1 year ago

Does this also mean that geom_tile does not work with discrete scales (scale_x_discrete)? I'm struggling to get my plotted data into the correct categories on the X axis.

teunbrand commented 1 year ago

So my thought was: "2.5 - 0.4 = 2.1, should bin to 2; 2.5 + 0.4 = 2.9, should bin to 3". Actually I think the bins ought to be 2.5, 3.5 etc. i.e. midpoints of the breaks.

I think that binning works slightly different than you're expecting here. It is more of a findInterval() situation than 'snap to nearest break'.

The underlying reason that geom_tile() doesn't behave like geom_rect(), is that the width and height are not position aesthetics, and thus aren't transformed by scales. So a width = 2 on a log10 scale spans 2 orders of magnitude. While admittedly not great for scale transforms, this parametrisation does allow it to work with many stats seamlessly.

@jfmusso It works for discrete scales because you can combine continuous values on a discrete scale (but not the other way around). Discrete position scales are esstentially seq_along(limits), so there is 1 axis unit between each level and a width = 2 spans 2 level's worth of axis.

hughjonesd commented 1 year ago

Perhaps is one issue that there are different potential users for GeomTile? I get that it might be useful for developers who want to e.g. place something at x,ywith a "real" onscreen width. But this makes it hard to understand for end users, who have to think in terms of two different sets of coordinates.

Perhaps it might be helpful to separate the two functionalities, and provide a public-facing version of geom_tile that indeed works in data coordinates.