tidyverse / ggplot2

An implementation of the Grammar of Graphics in R
https://ggplot2.tidyverse.org
Other
6.51k stars 2.03k forks source link

Feature request: support for tibble aesthetics #4189

Open davidchall opened 4 years ago

davidchall commented 4 years ago

I'm developing a ggplot2 extension where it would be helpful to pass a tibble column as an aesthetic.

As a simple motivating example, you could imagine a geom_box() layer with a bounds aesthetic that expects a tibble with columns xmin, ymin, xmax and ymax. Then I could simply do geom_box(bounds = bbox) and the layer will use the nested columns to draw the box.

To do this, we need to make use of nested tibbles (i.e. a tibble column within a tibble).

library(tidyverse)

dat <- tibble(a = tibble(x = 1:5, y = 6:10))

# works fine
ggplot(dat) +
  geom_point(aes(x = a$x, y = a$y))

So far, so good. But these aesthetics are still just vectors. Let's try using the tibble column a.

(Obviously, geom_point() doesn't expect a tibble column for its x aesthetic, but it does generate the relevant error.)

ggplot(dat) +
  geom_point(aes(x = a, y = a$y))
#> Don't know how to automatically pick scale for object of type tbl_df/tbl/data.frame. Defaulting to continuous.
#> Error: Aesthetics must be either length 1 or the same as the data (5): x

The warning can be removed with a new scale_type. But the error is generated by this line: https://github.com/tidyverse/ggplot2/blob/813d0bd8c7d4df3b3cba170de54233076b346c1c/R/geom-.r#L211

The length of a data frame is the number of columns (2) instead of the number of rows (5), so we get this error. This is the same problem that vctrs::vec_size() addresses.

Would it be possible to simply avoid this error by replacing length() with nrow(), NROW() or vec_size()? Or would there be other repercussions?

yutannihilation commented 4 years ago

Thanks. I'm not sure if this is really doable, but I think data frame column ("nested tibble," or "packed column"? I don't know the proper term for this...) should be supported. This might be a nice reason to start importing vctrs.

davidchall commented 4 years ago

To understand the motivation with a more fleshed out example, please take a look at this vignette. This package solves the problem using an vctrs-based class that behaves effectively like a data frame whose length() returns vec_size().

teunbrand commented 4 years ago

Disclaimer: the following is mostly out of selfish reasons, so feel free to dismiss entirely.

I think replacing the length() with NROW() is more appropriate than replacing it with vec_size(). Consider the following line from the vec_size() documentation:

vec_size() is equivalent to NROW() but has a name that is easier to pronounce, and throws an error when passed non-vector inputs.

The 'throws an error when passed non-vector input', will give some problems for things that are effectively vectors but aren't implemented as typical vectors (example: https://github.com/tidyverse/ggplot2/issues/3835).

clauswilke commented 4 years ago

I hadn't really processed the existence of NROW(). Is there any reason not to use it right now?

@teunbrand Why don't you create a PR so we can see if any tests fail.

teunbrand commented 4 years ago

Alright I created the PR and experimented a bit. Here are some problems I ran into while attempting to make tibble aesthetics work.

data.frame constructor

In the code below; https://github.com/tidyverse/ggplot2/blob/ac2b5a7bb460179e0cc8c4b3204795317dcfb9b8/R/performance.R#L7 we would run into the same problem, so if tibble aesthetics were to be supported, this would have to change too.

scales

The next issue I ran into was that the scales didn't accept the tibble, as was to expected. As I imagined an extension supporting tibble aesthetics, I made a quick and dirty scale_type.tbl_df(), scale_x_tibble_continuous() and ScaleContinuousTibble ggproto to test with (based on scale_x_continuous() and ScaleContinuousPosition).

transformation checking

Next, I ran into a problem with transformation checking. In particular, in the lines below: https://github.com/tidyverse/ggplot2/blob/ac2b5a7bb460179e0cc8c4b3204795317dcfb9b8/R/scale-.r#L1155 we get the same error as the following code produces:

is.finite(tibble::tibble(x = 1))
#> Error in is.finite(tibble::tibble(x = 1)): default method not implemented for type 'list'

I commented out the check_transformation() line in the ScaleContinuousTibble$transform() method.

scale_apply

Lastly, I got stuck at the scale training part in the scale_apply() function. In the following piece of code: https://github.com/tidyverse/ggplot2/blob/ac2b5a7bb460179e0cc8c4b3204795317dcfb9b8/R/layout.R#L302-L304 If you can parse this with all the brackets and parenthesis (or use the RStudio debugger). the data[[var]] object is still an intact tibble. However, by double-bracket subsetting a tibble we are using subsetting by column, whereas for scale training you'd want to subset by observation/row. At this point I decided I to stop exploring.

teunbrand commented 4 years ago

Giving this another thought, the vctrs rcrd S3 class is essentially a data.frame in that it is a collection of fields with vectors of equal length, just as the data.frame/tibble is practically a list with vectors of equal length. I can imagine it being easier to implement a rcrd subclass than to change the internals of ggplot to comply with rectangular data structures.

Potential downside is that you'd probably have to mirror quite a few of the scales package's functions that aren't S3 generics.

davidchall commented 4 years ago

Hi @teunbrand - thank you for digging into this problem so deeply! It turned out to be quite complex, so I can understand if you wouldn't want to merge this.

I found it reassuring that you came to the vctrs_rcrd solution too - this is exactly what I've been doing up until now 👍

yutannihilation commented 4 years ago

Thanks for your efforts! Let me leave some quick comments.

transformation checking

We can define another generic function like is_finite() to handle data frames transparently. So, this doesn't seem a serious problem to me.

scale_apply

This made me think ggplot2 needs a proper integration with vctrs (so that we can use vec_slice()?) to achieve this feature request.

One more thing I want to emphasize that it would be "data.frame" column, not "tibble" column, if we will support. I once experimented with using tibble inside ggplot2, but it seems impossible because may functions convert a tibble to a data.frame.

c.f. https://github.com/tidyverse/ggplot2/pull/3048

teunbrand commented 4 years ago

Yes an (exported) is_finite() generic for which people can write is_finite.myclass() methods seems like a good idea to me; I had to work around this before as well.

... ggplot2 needs a proper integration with vctrs ...

The scales package then would also need to support vctrs, or convert some of their functions to (S3) generics (out-of-bounds handeling, range expansion, scale training, scale transformations etc).

clauswilke commented 3 years ago

I'm not sure that exporting a function named is_finite() is a good idea. It seems likely to clash with a function in some other package, if not now then at some point in the future. Some other name for the function would be fine.

hadley commented 3 years ago

I suspect it isn’t worth doing this piecemeal but would be better left until we attack a vctrs integration.

thomasp85 commented 2 years ago

As the vctrs integration is happening now I'm reading through these older issues. While vctrs would solve some of the issues described herein this is obviously deeper than that and touches on the basis of the API itself.

There is no concept of multivalue scales in ggplot2 so training a scale on a tibble column makes zero sense. In the example with a bounds aesthetic you'd also want to use it to train the x and y scale rather than a bounds scale so this is even more out-of-line with how aesthetics work currently.

We will be facing similar issues when thinking about how to e.g. support grid gradients because those are made up of several different values, some relates to the position others to colour mapping etc.

All of this is to say that the impeding vctrs integration will do very little to move this issue forward and what is really needed is not coding but deep thoughts about how this should conceptually work.

One small idea we could discuss was to have something like "aesthetic unpacking" in layers where you can unpack a tibble column into separate aesthetics automatically. This could be done in a single step whereafter everything would proceed as normal

clauswilke commented 2 years ago

@thomasp85 It's definitely possible to write scales that train on multi-dimensional input. I experimented with this a long time ago. At the time, the biggest challenge was actually to pass the data through.

https://github.com/clauswilke/multiscales

If you think about it, the geometry column in sf objects is also multi-dimensional input.

In general, I think there are two distinct categories of cases:

  1. Multi-dimensional input gets mapped onto a single output. E.g., two dimensions of input determine a specific color. This can be handled simply by implementing a scale function that can perform the mapping. This doesn't require any other changes in ggplot, and it's the scenario I played around with in my linked demo.

  2. Multi-dimensional input gets mapped onto multiple outputs, and possibly x/y coordinates. This requires more work, and typically an appropriate geom and/or coord. Example is the entire geom_sf() infrastructure.

teunbrand commented 2 months ago

Let's leave the scale-side of the problem to extension devs and just move any barriers on the ggplot2 side out of the way. I feel that this should just return the 2-column data.frame that we put in. (label was chosen as I think no computation occurs on that column)

library(ggplot2)

data <- mtcars
data$df <- data.frame(x = mtcars$cyl, y = mtcars$carb)

p <- ggplot(data, aes(disp, mpg, label = df)) +
  geom_text()

layer_data(p)$label
#> Don't know how to automatically pick scale for object of type <data.frame>.
#> Defaulting to continuous.
#> Error in `geom_text()`:
#> ! Problem while computing aesthetics.
#> ℹ Error occurred in the 1st layer.
#> Caused by error in `check_aesthetics()` at ggplot2/R/layer.R:334:5:
#> ! Aesthetics must be either length 1 or the same as the data (32).
#> ✖ Fix the following mappings: `label`.

Created on 2024-09-04 with reprex v2.1.1