tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.75k stars 2.12k forks source link

`select(df, colname)` sometimes impersonates `select_(df, .dots=colname)` #2904

Closed huftis closed 6 years ago

huftis commented 7 years ago

select(df, colname) should issue an error message when colname is not a column in df. However, if there exists a (global) colname variable which is a character vector, the columns in df corresponding to the elements in colname are instead returned. Basically, select(df, colname) works like select_(df, .dots=colname) iff there is no column named colname in df.

# Load package and example data
library(dplyr)
df = as_data_frame(iris)

# Create two identical variables string variables
Species = c("Sepal.Width", "Petal.Length")
myvar = Species

df %>% select(Species) # OK, returns Species column
df %>% select(myvar)   # Not OK, returns Sepal.Width & Petal.Length columns
df %>% select(myvar2)  # OK, returns error message

The last command gives the error message #> Error in overscope_eval_next(overscope, expr): object 'myvar2' not found. The next to last line should have also given a similar error message, since there is no myvar colmn in df. But instead, it returns a tibble with the columns Sepal.Width and Petal.Length. This is unexpected and dangerous behaviour.

I observe this bug with dplyr 0.7.1 and the latest GitHub version (as of 2017-06-24).

lionel- commented 7 years ago

This is just consistent semantics of evaluation, if a symbol is not found in the data frame, it is looked up in the context.

However it might make sense to make an exception for select() and pull() as the semantics are already a bit different than other verbs (e.g. function calls are not evaluated within the data frame). Then we can always unquote with !! if we really want a variable defined in the environment. It's easy to document as well: symbols are evaluated in the data frame alone, expressions are evaluated in the environment alone.

MZLABS commented 7 years ago

(John Mount here) In my opinion the "Species" name should not be given two chances to match the frame. I get that if you match the frame you go to the frame, and then if not go to the context. But what is happening is if the name "Species" or the contents of "Species" match the frame the frame wins, and only then the system looks out to the context. Under this interpretation a user can not read a code fragment and know what it does without also knowing what value (if any) "Species" may be carrying.

lionel- commented 7 years ago

That's dplyr semantics since the beginning so we're not going to change them. Scoping has always been data first, context second. With tidyeval we offer several ways of being more explicit.

That said, select() is special enough that we might make an exception.

lionel- commented 7 years ago

That's dplyr semantics since the beginning

It's worth noting that model formulas from base R and even S work the same way.

MZLABS commented 7 years ago

It's worth noting that model formulas from base R and even S work the same way.

Formulas in base R do not have the double lookup in the frame property. I agree looking two places (frame then environment) is good. But trying the frame twice is not good (new or not).

In a base R formula model.matrix(~x, data.frame(x='y', y=1)) does not evaluate to the same as model.matrix(~y, data.frame(x='y', y=1)) just because the x column happens to contain a column name. Similarly x='y'; model.matrix(~x, data.frame(y=1)) also (rightly) does not evaluate to the same value as model.matrix(~y, data.frame(y=1)).

For dplyr 0.7.1 we have mpg <- 'cyl' ; select(mtcars, mpg) selecting the mpg column and v <- 'cyl' ; select(mtcars, v) selecting the cyl column. Meaning before even going to the environment both the variable name and the value in the variable are checked against the data.frame. This also seems to be a special case for select(). Observe that mpg <- 'cyl' ; transmute(mtcars, mpg) and v <- 'cyl' ; transmute(mtcars, v) don't snoop inside the variables mpg and v to get column names.

That's dplyr semantics since the beginning

For dplyr 0.5.0 we have mpg <- 'cyl' ; select(mtcars, mpg) selecting the mpg column and v <- 'cyl' ; select(mtcars, v) throwing. Though mpg <- 2; select(mtcars, mpg) does select the mpg column while v <- 2; select(mtcars, v) selects the cyl column. So the behavior was there for integers, but not for strings.

lionel- commented 7 years ago

For dplyr we have

No, this is only for verbs with selection semantics. For those verbs, the columns in the data environment represent column positions, not column values. Then we select based on those positions. This helps simplifying the implementation of selecting helpers. This also explains the behaviour we're seeing here.

Again: we will consider making an exception for the scoping of symbols in selection verbs because the semantics are special.

hadley commented 6 years ago

The semantics were reconsidered as part of tidyselect and will be incorporated into dplyr once we use tidyselect.