Closed huftis closed 6 years ago
This is just consistent semantics of evaluation, if a symbol is not found in the data frame, it is looked up in the context.
However it might make sense to make an exception for select()
and pull()
as the semantics are already a bit different than other verbs (e.g. function calls are not evaluated within the data frame). Then we can always unquote with !!
if we really want a variable defined in the environment. It's easy to document as well: symbols are evaluated in the data frame alone, expressions are evaluated in the environment alone.
(John Mount here) In my opinion the "Species" name should not be given two chances to match the frame. I get that if you match the frame you go to the frame, and then if not go to the context. But what is happening is if the name "Species" or the contents of "Species" match the frame the frame wins, and only then the system looks out to the context. Under this interpretation a user can not read a code fragment and know what it does without also knowing what value (if any) "Species" may be carrying.
That's dplyr semantics since the beginning so we're not going to change them. Scoping has always been data first, context second. With tidyeval we offer several ways of being more explicit.
That said, select()
is special enough that we might make an exception.
That's dplyr semantics since the beginning
It's worth noting that model formulas from base R and even S work the same way.
It's worth noting that model formulas from base R and even S work the same way.
Formulas in base R do not have the double lookup in the frame property. I agree looking two places (frame then environment) is good. But trying the frame twice is not good (new or not).
In a base R formula model.matrix(~x, data.frame(x='y', y=1))
does not evaluate to the same as model.matrix(~y, data.frame(x='y', y=1))
just because the x
column happens to contain a column name. Similarly x='y'; model.matrix(~x, data.frame(y=1))
also (rightly) does not evaluate to the same value as model.matrix(~y, data.frame(y=1))
.
For dplyr
0.7.1
we have mpg <- 'cyl' ; select(mtcars, mpg)
selecting the mpg
column and v <- 'cyl' ; select(mtcars, v)
selecting the cyl
column. Meaning before even going to the environment both the variable name and the value in the variable are checked against the data.frame
. This also seems to be a special case for select()
. Observe that mpg <- 'cyl' ; transmute(mtcars, mpg)
and v <- 'cyl' ; transmute(mtcars, v)
don't snoop inside the variables mpg
and v
to get column names.
That's dplyr semantics since the beginning
For dplyr
0.5.0
we have mpg <- 'cyl' ; select(mtcars, mpg)
selecting the mpg
column and v <- 'cyl' ; select(mtcars, v)
throwing. Though mpg <- 2; select(mtcars, mpg)
does select the mpg
column while v <- 2; select(mtcars, v)
selects the cyl
column. So the behavior was there for integers, but not for strings.
For dplyr we have
No, this is only for verbs with selection semantics. For those verbs, the columns in the data environment represent column positions, not column values. Then we select based on those positions. This helps simplifying the implementation of selecting helpers. This also explains the behaviour we're seeing here.
Again: we will consider making an exception for the scoping of symbols in selection verbs because the semantics are special.
The semantics were reconsidered as part of tidyselect and will be incorporated into dplyr once we use tidyselect.
select(df, colname)
should issue an error message whencolname
is not a column indf
. However, if there exists a (global)colname
variable which is a character vector, the columns indf
corresponding to the elements incolname
are instead returned. Basically,select(df, colname)
works likeselect_(df, .dots=colname)
iff there is no column namedcolname
indf
.The last command gives the error message
#> Error in overscope_eval_next(overscope, expr): object 'myvar2' not found
. The next to last line should have also given a similar error message, since there is nomyvar
colmn indf
. But instead, it returns a tibble with the columnsSepal.Width
andPetal.Length
. This is unexpected and dangerous behaviour.I observe this bug with dplyr 0.7.1 and the latest GitHub version (as of 2017-06-24).