Closed hadley closed 5 years ago
I think what needs to be caught are vectors of constants in aes()
. The same problem exists for parameters, but it may be more difficult to check systematically. (Maybe there is a geom somewhere that uses vectors of constants as parameters for valid reasons.)
Here is a slightly modified reprex from the off-topic discussion in #2660, to highlight the issues:
On StackOverflow I used to refer to this example
Possible implementations:
In aes()
, inspect the AST for uses of $
. This is simple, but there are legitimate uses of $
not related to retrieving a column from a data frame. However, we'd probably catch 90% of bad usage only by looking for top-level usage of $
, i.e. aes(df$x)
not aes(df$x + 2)
. Hopefully that would be enough of a hint to steer people in the right direction generally.
Evaluate aesthetics in an environment where $.data.frame
throws an error. This is probably going to be more challenging as R moves away from lexical scoping for S3 methods and towards pre-registration, but I'm sure we could make it work with sufficient ingenuity. The downside is that you wouldn't be able to refer to columns within a dataframe-column (which I think over the next few years the tidyverse is going to move towards supporting)
Neither of those techniques would catch this related problem:
df <- data.frame(x = 1:3)
y <- 3:1
ggplot(df, aes(x, y)) + geom_point()
Or code that's handy, but potentially problematic like aes(x, resid(mod))
.
I wonder if we should rethink how the facetting code works to ensure that we don't mess up vectors that don't come from the data? It might be possible to fix by evaluating the aesthetics in a different way (although that might
Then identifying $
usage would just be a style hint, not something that is critical for correct ggplot2 code.
The problem arises with the reordering of the data frame on this line: https://github.com/tidyverse/ggplot2/blob/5e9b6889ef7f7d0ec87c7a22708d9f2f3c03b343/R/facet-wrap.r#L202
It's not immediately clear to me that the reordering does anything other than make the created data frame look nice. When I replace that line with just data
things seem to work. Only two regression tests fail, and those explicitly assume a particular order in the calculated data frame.
library(ggplot2) # patched line 202 in facet-wrap.r
df <- data.frame(x = 1:4,
y = 1:4,
label = c("a", "b", "a", "b"),
colour = c("magenta", "green", "magenta", "green"))
# works
ggplot(df, aes(x, y, colour = df$colour)) +
geom_point(size = 3) +
facet_wrap(~label) +
scale_colour_identity()
# works
ggplot(df, aes(x, y)) +
geom_point(size = 3, colour = df$colour) +
facet_wrap(~label) +
scale_colour_identity()
Created on 2018-06-10 by the reprex package (v0.2.0).
FWIW I think it is a simple design mistake (can't remember if I introduced it during the rewrite, but it doesn't matter anyway). map_data
should just add a PANEL
column to the data - nothing else
It's been there a long time, at least since https://github.com/tidyverse/ggplot2/commit/687ba6ef8a42b136fdb73cb5690b93cf272568ec.
Let's remove in the next version, and then think about how to make a warning message steering people away from $
usage.
I just made a pull request, so the required changes are recorded. I didn't encounter any problems with geom_point()
, but more extensive testing would be good.
If this is still something worth fixing, I think there are some possibilities. First, I think it is possible to inspect the AST and look for calls to $
where the target is the data:
library(rlang)
aes <- ggplot2::aes
is_dollar_sign_call <- function(call) {
is.call(call) && identical(call[[1]], quote(`$`))
}
is_double_bracket_call <- function(call) {
is.call(call) && identical(call[[1]], quote(`[[`))
}
is_extract_call <- function(call) {
is_dollar_sign_call(call) || is_double_bracket_call(call)
}
check_extract_usage_expr <- function(x, data, env = emptyenv()) {
if (is_extract_call(x)) {
data_eval <- eval_tidy(x[[2]], data, env)
if(identical(tracemem(data_eval), tracemem(data))) {
bad_call <- x
good_call <- x
good_call[[2]] <- quote(.data)
warning(
"Use of `", format(bad_call), "` is discouraged. ",
"Use `", format(good_call), "` instead.",
call. = FALSE
)
}
} else if (is.call(x)) {
lapply(x, check_extract_usage_expr, data, env)
} else if (is.pairlist(x)) {
lapply(x, check_extract_usage_expr, data, env)
}
invisible()
}
check_extract_usage_quo <- function(quosure, data) {
check_extract_usage_expr(get_expr(quosure), data, get_env(quosure))
}
check_extract_usage <- function(mapping, data) {
lapply(mapping, check_extract_usage_quo, data)
invisible(mapping)
}
# data
returns_x <- function() "x"
df <- tibble::tibble(x = 1:5, nested_df = tibble::tibble(x = 6:10))
# good
check_extract_usage(aes(x), df)
check_extract_usage(aes(.data$x), df)
check_extract_usage(aes(.data[["x"]]), df)
check_extract_usage(aes(.data[[!!quo("x")]]), df)
check_extract_usage(aes(.data[[returns_x()]]), df)
check_extract_usage(aes(!!sym("x")), df)
check_extract_usage(aes(x * 10), df)
check_extract_usage(aes(nested_df$x), df)
check_extract_usage(aes(nested_df[["x"]]), df)
check_extract_usage(aes(.data[[c("nested_df", "x")]]), df)
check_extract_usage(aes(.data[[c(2, 1)]]), df)
check_extract_usage(aes(.data[[1]]), df)
# bad: use of extraction
check_extract_usage(aes(df$x), df)
#> Warning: Use of `df$x` is discouraged. Use `.data$x` instead.
check_extract_usage(aes(df[["x"]]), df)
#> Warning: Use of `df[["x"]]` is discouraged. Use `.data[["x"]]` instead.
Created on 2019-05-28 by the reprex package (v0.2.1)
Similarly, it is possible to look for places where no columns from data
were mapped at all. This may help with #2689 (in that it gives some measure of which names were referred to), although I'm less sure of the corner cases on this one.
library(rlang)
aes <- ggplot2::aes
is_dollar_sign_call <- function(call) {
is.call(call) && identical(call[[1]], quote(`$`))
}
is_double_bracket_call <- function(call) {
is.call(call) && identical(call[[1]], quote(`[[`))
}
is_extract_call <- function(call) {
is_dollar_sign_call(call) || is_double_bracket_call(call)
}
is_pronoun_operation <- function(call) {
is.call(call) && identical(call[[2]], quote(.data))
}
index_value_to_name <- function(index, data) {
if (is.character(index)) {
index[1]
} else if (is.numeric(index)) {
names(data)[index[1]]
} else {
NULL
}
}
find_data_references <- function(x, data, env = emptyenv()) {
if (is.name(x) && (as.character(x) %in% names(data))) {
as.character(x)
} else if (is_double_bracket_call(x) && is_pronoun_operation(x)) {
index_value_to_name(eval_tidy(x[[3]], data, env), data)
} else if (is_dollar_sign_call(x) && is_pronoun_operation(x)) {
as.character(x[[3]])
} else if(is_dollar_sign_call(x)) {
find_data_references(x[[2]], data, env)
} else if (is.call(x)) {
new_names <- lapply(x, find_data_references, data, env)
unlist(new_names)
} else if (is.pairlist(x)) {
new_names <- lapply(x, find_data_references, data, env)
unlist(new_names)
}
}
find_data_references_quo <- function(quosure, data) {
find_data_references(get_expr(quosure), data, get_env(quosure))
}
find_mapped_cols <- function(mapping, data) {
intersect(unlist(lapply(mapping, find_data_references_quo, data)), names(data))
}
check_mapping <- function(mapping, data) {
data_name <- as_label(enquo(data))
cols <- find_mapped_cols(mapping, data)
if (length(cols) == 0) {
stop("At least one column in `", data_name, "` must be mapped", call. = FALSE)
}
invisible(mapping)
}
# data
returns_x <- function() "x"
df <- tibble::tibble(x = 1:5, nested_df = tibble::tibble(x = 6:10))
# good
check_mapping(aes(x), df)
check_mapping(aes(.data$x), df)
check_mapping(aes(.data[["x"]]), df)
check_mapping(aes(.data[[!!quo("x")]]), df)
check_mapping(aes(.data[[returns_x()]]), df)
check_mapping(aes(!!sym("x")), df)
check_mapping(aes(x * 10), df)
check_mapping(aes(nested_df$x), df)
check_mapping(aes(nested_df[["x"]]), df)
check_mapping(aes(.data[[c("nested_df", "x")]]), df)
check_mapping(aes(.data[[c(2, 1)]]), df)
check_mapping(aes(.data[[1]]), df)
# bad: use of extraction
check_mapping(aes(df$x), df)
#> Error: At least one column in `df` must be mapped
check_mapping(aes(df[["x"]]), df)
#> Error: At least one column in `df` must be mapped
# bad: no columns were mapped
check_mapping(aes(not_a_column_in_df), df)
#> Error: At least one column in `df` must be mapped
check_mapping(aes(not_a_column_in_df$x), df)
#> Error: At least one column in `df` must be mapped
check_mapping(aes(), df)
#> Error: At least one column in `df` must be mapped
# error during evaluation
check_mapping(aes(.data[[not_a_function()]]), df)
#> Error in not_a_function(): could not find function "not_a_function"
Created on 2019-05-28 by the reprex package (v0.2.1)
I think that overall approach has legs, so I think it's worth turning into a PR so I can give it full review. A few quick thoughts before you do that:
You can collapse (e.g.) is_dollar_sign_call()
to is_call(x, "$")
which I think means you can eliminate the named functions
Use is_reference()
instead of identical()
+ tracemem()
I don't think we should be routinely warning people to use .data
. I think in most cases we need to tell them to just use the variable name
I'm worried that check_mapping()
is going to be too strict. Maybe we should also allow constants, and only make it a warning? (But then how do you turn the warning off?)
I'm worried that
check_mapping()
is going to be too strict. Maybe we should also allow constants, and only make it a warning? (But then how do you turn the warning off?)
Yes. It's important that examples such as the following continue to work. Not sure whether this would work with the current implementation of check_mapping()
.
library(ggplot2)
ggplot(mtcars, aes(disp, mpg)) +
geom_point(aes(color = "data")) +
geom_smooth(aes(color = "loess smooth"), se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Created on 2019-05-30 by the reprex package (v0.2.1)
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/
Quite a few people seem to do this, and it introduces potentially subtle bad behaviour when you start facetting. I think we now have sufficient understanding to detect this correctly, although we’ll need to carefully consider how to override
$
so ok usage is not caught.