Warn about use of $ in aes()

hadley commented 6 years ago

Quite a few people seem to do this, and it introduces potentially subtle bad behaviour when you start facetting. I think we now have sufficient understanding to detect this correctly, although we’ll need to carefully consider how to override $ so ok usage is not caught.

clauswilke commented 6 years ago

I think what needs to be caught are vectors of constants in aes(). The same problem exists for parameters, but it may be more difficult to check systematically. (Maybe there is a geom somewhere that uses vectors of constants as parameters for valid reasons.)

Here is a slightly modified reprex from the off-topic discussion in #2660, to highlight the issues:

``` r library(ggplot2) df <- data.frame(x = 1:4, y = 1:4, label = c("a", "b", "a", "b"), colour = c("magenta", "green", "magenta", "green")) # using a data column inside aes causes problems, the colors don't match within facets ggplot(df, aes(x, y, colour = df$colour)) + geom_point(size = 3) + facet_wrap(~label) + scale_colour_identity() ``` ![](https://i.imgur.com/pWm7FH0.png) ``` r # directly handing the colors to `geom_point()` as a parameter is also wrong ggplot(df, aes(x, y)) + geom_point(size = 3, colour = df$colour) + facet_wrap(~label) ``` ![](https://i.imgur.com/BbV1hyz.png) ``` r # this is valid code, however, if a bit weird df2 <- data.frame(point = "point", label = "label") ggplot(df, aes(x, y)) + geom_point(aes(colour = df2$point), size = 3) + geom_text(aes(x = x + 0.2, label = label, colour = df2$label)) ``` ![](https://i.imgur.com/vUaLHS9.png) Created on 2018-06-09 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).

baptiste commented 6 years ago

On StackOverflow I used to refer to this example

hadley commented 6 years ago

Possible implementations:

In aes(), inspect the AST for uses of $. This is simple, but there are legitimate uses of $ not related to retrieving a column from a data frame. However, we'd probably catch 90% of bad usage only by looking for top-level usage of $, i.e. aes(df$x) not aes(df$x + 2). Hopefully that would be enough of a hint to steer people in the right direction generally.
Evaluate aesthetics in an environment where $.data.frame throws an error. This is probably going to be more challenging as R moves away from lexical scoping for S3 methods and towards pre-registration, but I'm sure we could make it work with sufficient ingenuity. The downside is that you wouldn't be able to refer to columns within a dataframe-column (which I think over the next few years the tidyverse is going to move towards supporting)

Neither of those techniques would catch this related problem:

df <- data.frame(x = 1:3)
y <- 3:1

ggplot(df, aes(x, y)) + geom_point()

Or code that's handy, but potentially problematic like aes(x, resid(mod)).

I wonder if we should rethink how the facetting code works to ensure that we don't mess up vectors that don't come from the data? It might be possible to fix by evaluating the aesthetics in a different way (although that might

Then identifying $ usage would just be a style hint, not something that is critical for correct ggplot2 code.

clauswilke commented 6 years ago

The problem arises with the reordering of the data frame on this line: https://github.com/tidyverse/ggplot2/blob/5e9b6889ef7f7d0ec87c7a22708d9f2f3c03b343/R/facet-wrap.r#L202

It's not immediately clear to me that the reordering does anything other than make the created data frame look nice. When I replace that line with just data things seem to work. Only two regression tests fail, and those explicitly assume a particular order in the calculated data frame.

library(ggplot2) # patched line 202 in facet-wrap.r

df <- data.frame(x = 1:4,
                 y = 1:4,
                 label = c("a", "b", "a", "b"),
                 colour =  c("magenta", "green", "magenta", "green"))

# works
ggplot(df, aes(x, y, colour = df$colour)) +
  geom_point(size = 3) +
  facet_wrap(~label) +
  scale_colour_identity()


# works
ggplot(df, aes(x, y)) +
  geom_point(size = 3, colour = df$colour) +
  facet_wrap(~label) +
  scale_colour_identity()

Created on 2018-06-10 by the reprex package (v0.2.0).

thomasp85 commented 6 years ago

FWIW I think it is a simple design mistake (can't remember if I introduced it during the rewrite, but it doesn't matter anyway). map_data should just add a PANEL column to the data - nothing else

hadley commented 6 years ago

It's been there a long time, at least since https://github.com/tidyverse/ggplot2/commit/687ba6ef8a42b136fdb73cb5690b93cf272568ec.

Let's remove in the next version, and then think about how to make a warning message steering people away from $ usage.

clauswilke commented 6 years ago

I just made a pull request, so the required changes are recorded. I didn't encounter any problems with geom_point(), but more extensive testing would be good.

paleolimbot commented 5 years ago

If this is still something worth fixing, I think there are some possibilities. First, I think it is possible to inspect the AST and look for calls to $ where the target is the data:

library(rlang)
aes <- ggplot2::aes

is_dollar_sign_call <- function(call) {
  is.call(call) && identical(call[[1]], quote(`$`))
}

is_double_bracket_call <- function(call) {
  is.call(call) && identical(call[[1]], quote(`[[`))
}

is_extract_call <- function(call) {
  is_dollar_sign_call(call) || is_double_bracket_call(call)
}

check_extract_usage_expr <- function(x, data, env = emptyenv()) {
  if (is_extract_call(x)) {
    data_eval <- eval_tidy(x[[2]], data, env)
    if(identical(tracemem(data_eval), tracemem(data))) {
      bad_call <- x
      good_call <- x
      good_call[[2]] <- quote(.data)
      warning(
        "Use of `", format(bad_call), "` is discouraged. ",
        "Use `", format(good_call),  "` instead.",
        call. = FALSE
      )
    }
  } else if (is.call(x)) {
    lapply(x, check_extract_usage_expr, data, env)
  } else if (is.pairlist(x)) {
    lapply(x, check_extract_usage_expr, data, env)
  }

  invisible()
}

check_extract_usage_quo <- function(quosure, data) {
  check_extract_usage_expr(get_expr(quosure), data, get_env(quosure))
}

check_extract_usage <- function(mapping, data) {
  lapply(mapping, check_extract_usage_quo, data)
  invisible(mapping)
}

# data
returns_x <- function() "x"
df <- tibble::tibble(x = 1:5, nested_df = tibble::tibble(x = 6:10))

# good
check_extract_usage(aes(x), df)
check_extract_usage(aes(.data$x), df)
check_extract_usage(aes(.data[["x"]]), df)
check_extract_usage(aes(.data[[!!quo("x")]]), df)
check_extract_usage(aes(.data[[returns_x()]]), df)
check_extract_usage(aes(!!sym("x")), df)
check_extract_usage(aes(x * 10), df)
check_extract_usage(aes(nested_df$x), df)
check_extract_usage(aes(nested_df[["x"]]), df)
check_extract_usage(aes(.data[[c("nested_df", "x")]]), df)
check_extract_usage(aes(.data[[c(2, 1)]]), df)
check_extract_usage(aes(.data[[1]]), df)

# bad: use of extraction
check_extract_usage(aes(df$x), df)
#> Warning: Use of `df$x` is discouraged. Use `.data$x` instead.
check_extract_usage(aes(df[["x"]]), df)
#> Warning: Use of `df[["x"]]` is discouraged. Use `.data[["x"]]` instead.

^{Created on 2019-05-28 by the reprex package (v0.2.1)}

Similarly, it is possible to look for places where no columns from data were mapped at all. This may help with #2689 (in that it gives some measure of which names were referred to), although I'm less sure of the corner cases on this one.

library(rlang)
aes <- ggplot2::aes

is_dollar_sign_call <- function(call) {
  is.call(call) && identical(call[[1]], quote(`$`))
}

is_double_bracket_call <- function(call) {
  is.call(call) && identical(call[[1]], quote(`[[`))
}

is_extract_call <- function(call) {
  is_dollar_sign_call(call) || is_double_bracket_call(call)
}

is_pronoun_operation <- function(call) {
  is.call(call) && identical(call[[2]], quote(.data))
}

index_value_to_name <- function(index, data) {
  if (is.character(index)) {
    index[1]
  } else if (is.numeric(index)) {
    names(data)[index[1]]
  } else {
    NULL
  }
}

find_data_references <- function(x, data, env = emptyenv()) {
  if (is.name(x) && (as.character(x) %in% names(data))) {
    as.character(x)
  } else if (is_double_bracket_call(x) && is_pronoun_operation(x)) {
    index_value_to_name(eval_tidy(x[[3]], data, env), data)
  } else if (is_dollar_sign_call(x) && is_pronoun_operation(x)) {
    as.character(x[[3]])
  } else if(is_dollar_sign_call(x)) {
    find_data_references(x[[2]], data, env)
  } else if (is.call(x)) {
    new_names <- lapply(x, find_data_references, data, env)
    unlist(new_names)
  } else if (is.pairlist(x)) {
    new_names <- lapply(x, find_data_references, data, env)
    unlist(new_names)
  }
}

find_data_references_quo <- function(quosure, data) {
  find_data_references(get_expr(quosure), data, get_env(quosure))
}

find_mapped_cols <- function(mapping, data) {
  intersect(unlist(lapply(mapping, find_data_references_quo, data)), names(data))
}

check_mapping <- function(mapping, data) {
  data_name <- as_label(enquo(data))
  cols <- find_mapped_cols(mapping, data)
  if (length(cols) == 0) {
    stop("At least one column in `", data_name, "` must be mapped", call. = FALSE)
  }

  invisible(mapping)
}

# data
returns_x <- function() "x"
df <- tibble::tibble(x = 1:5, nested_df = tibble::tibble(x = 6:10))

# good
check_mapping(aes(x), df)
check_mapping(aes(.data$x), df)
check_mapping(aes(.data[["x"]]), df)
check_mapping(aes(.data[[!!quo("x")]]), df)
check_mapping(aes(.data[[returns_x()]]), df)
check_mapping(aes(!!sym("x")), df)
check_mapping(aes(x * 10), df)
check_mapping(aes(nested_df$x), df)
check_mapping(aes(nested_df[["x"]]), df)
check_mapping(aes(.data[[c("nested_df", "x")]]), df)
check_mapping(aes(.data[[c(2, 1)]]), df)
check_mapping(aes(.data[[1]]), df)

# bad: use of extraction
check_mapping(aes(df$x), df)
#> Error: At least one column in `df` must be mapped
check_mapping(aes(df[["x"]]), df)
#> Error: At least one column in `df` must be mapped

# bad: no columns were mapped
check_mapping(aes(not_a_column_in_df), df)
#> Error: At least one column in `df` must be mapped
check_mapping(aes(not_a_column_in_df$x), df)
#> Error: At least one column in `df` must be mapped
check_mapping(aes(), df) 
#> Error: At least one column in `df` must be mapped

# error during evaluation
check_mapping(aes(.data[[not_a_function()]]), df)
#> Error in not_a_function(): could not find function "not_a_function"

^{Created on 2019-05-28 by the reprex package (v0.2.1)}

hadley commented 5 years ago

I think that overall approach has legs, so I think it's worth turning into a PR so I can give it full review. A few quick thoughts before you do that:

You can collapse (e.g.) is_dollar_sign_call() to is_call(x, "$") which I think means you can eliminate the named functions
Use is_reference() instead of identical() + tracemem()
I don't think we should be routinely warning people to use .data. I think in most cases we need to tell them to just use the variable name
I'm worried that check_mapping() is going to be too strict. Maybe we should also allow constants, and only make it a warning? (But then how do you turn the warning off?)

clauswilke commented 5 years ago

I'm worried that check_mapping() is going to be too strict. Maybe we should also allow constants, and only make it a warning? (But then how do you turn the warning off?)

Yes. It's important that examples such as the following continue to work. Not sure whether this would work with the current implementation of check_mapping().

library(ggplot2)

ggplot(mtcars, aes(disp, mpg)) +
  geom_point(aes(color = "data")) +
  geom_smooth(aes(color = "loess smooth"), se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

^{Created on 2019-05-30 by the reprex package (v0.2.1)}

lock[bot] commented 4 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

tidyverse / ggplot2

Warn about use of $ in aes() #2693