filter does not use variables in global environment when variable names is the same as a column name

lmullen commented 7 years ago

dplyr::filter() does not use variables in the global environment when doing logical comparisons when the variable is named the same thing as a column name. For instance, in this example below I would expect all the filtering actions to be equivalent, i.e., returning only the rows where mtcars$cyl == 6. But in the second example, it returns every row where the cyl column is equal to itself, which is of course all of the columns.

library(dplyr)
cyl <- 6
cyl_to_filter <- 6
mtcars %>% filter(cyl == 6) %>% nrow()
#> [1] 7
mtcars %>% filter(cyl == cyl) %>% nrow()
#> [1] 32
mtcars %>% filter(cyl == cyl_to_filter) %>% nrow()
#> [1] 7

I could have sworn that earlier versions of dplyr would recognize a variable on the RHS of a == being the one in a global environment.

JohnMount commented 7 years ago

I guess you would want something opposite of "dplyr pronouns" (which force references to the data frame, you want an "environment pronoun"- and I am not sure dplyr has one). I can think of a work-around such as this (but doubt that actually helps with any actual application you have):

suppressPackageStartupMessages(library("dplyr"))

# somebody in the environment quotes the 6 into cyl so we can use !!
cyl <- rlang::quo(6)

mtcars %>% filter(cyl == !!cyl) %>% nrow()
#> [1] 7

Closest I can think of is:

suppressPackageStartupMessages(library("dplyr"))

# value of cyl just happens to be in the environment,
# not specialized for our use (as we may not be the 
# only user of it).
cyl <- 6

env <- environment()
mtcars %>% filter(cyl == get('cyl', env)) %>% nrow()
#> [1] 7

foo-bar-baz-qux commented 7 years ago

To add to @JohnMount's suggestion, enquo() might also be a useful tidyeval function to consider depending on your use case. Suppose we have cyl passed as a variable to a the function:

filter_test <- function(cyl) {
  cyl = enquo(cyl)
  mtcars %>% filter(cyl == (!!cyl)) %>% nrow()
}

target_val = 6
filter_test(target_val)
#> [1] 7

lionel- commented 7 years ago

I could have sworn that earlier versions of dplyr would recognize a variable on the RHS of a == being the one in a global environment.

If that's true then it was a bug. The precedence of data frame columns over contextual objects is standard R semantics and that's how it works everywhere: lm() formulas, subset(), ggplot2, dplyr, etc.

If you'd like to be explicit about where the variable comes from, you can now use the tool of quasiquotation:

mtcars %>% filter(cyl == (!! cyl)) %>% nrow()

!! always evaluates in the context and bypasses the data frame.

JohnMount commented 7 years ago

I could have sworn that earlier versions of dplyr would recognize a variable on the RHS of a == being the one in a global environment.

Even the current dplyr recognizes the RHS as being in the environment in some situations (but not in others). It appears to depend (in very confusing detail) on the nature of the verb and the data supplier (in-memory versus back-end).

The following example shows the behavior of RHS terms changing based on situation.


suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
#> [1] '0.7.4'

genderTarget = 'male'

starwars %>% 
  summarize(fracMale = mean(ifelse(gender == genderTarget, 1, 0), 
                            na.rm = TRUE))
#> # A tibble: 1 x 1
#>    fracMale
#>       <dbl>
#> 1 0.7380952

# matches simple calculation 0.7380952
mean(starwars$gender==genderTarget, na.rm = TRUE)
#> [1] 0.7380952

my_db <- dplyr::src_sqlite(":memory:", create = TRUE)
# or can connect with:
# my_db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# RSQLite::initExtension(my_db) # needed for many summary fns.
starwars_db <- copy_to(my_db, 
                       select(starwars, -vehicles, -starships, -films), 
                       'starwars_db')

# works
starwars_db %>% transmute(genderTarget = genderTarget)
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.19.3 [:memory:]
#>    genderTarget
#>           <chr>
#>  1         male
#>  2         male
#>  3         male
#>  4         male
#>  5         male
#>  6         male
#>  7         male
#>  8         male
#>  9         male
#> 10         male
#> # ... with more rows

# fails
starwars_db %>% 
  summarize(fracMale = mean(ifelse(gender == genderTarget, 1, 0), 
                            na.rm = TRUE))
#> na.rm not needed in SQL: NULL are always droppedFALSE
#> Error in rsqlite_send_query(conn@ptr, statement): no such column: genderTarget

# fails
starwars_db %>% 
  summarize(fracMale = mean(if_else(gender == genderTarget, 1, 0), 
                            na.rm = TRUE))
#> na.rm not needed in SQL: NULL are always droppedFALSE
#> Error in rsqlite_send_query(conn@ptr, statement): no such column: genderTarget

# attempted fix: use "!!" to force bind target to environment
# but notice result 0.7126437 does not match in memory result 0.7380952
starwars_db %>% 
  summarize(fracMale = mean(ifelse(gender == !!genderTarget, 1, 0), 
                            na.rm = TRUE))
#> na.rm not needed in SQL: NULL are always droppedFALSE
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.19.3 [:memory:]
#>    fracMale
#>       <dbl>
#> 1 0.7126437

# attempted fix: use "!!" to force bind target to environment
# but notice result 0.7126437 does not match in memory result 0.7380952
starwars_db %>% 
  summarize(fracMale = mean(if_else(gender == !!genderTarget, 1, 0), 
                            na.rm = TRUE))
#> na.rm not needed in SQL: NULL are always droppedFALSE
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.19.3 [:memory:]
#>    fracMale
#>       <dbl>
#> 1 0.7126437

# work around
starwars_db %>% 
  summarize(fracMale = sum(ifelse(gender == !!genderTarget, 1, 0))/
              sum(!is.na(gender)))
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.19.3 [:memory:]
#>    fracMale
#>       <dbl>
#> 1 0.7380952

tidyverse / dplyr

filter does not use variables in global environment when variable names is the same as a column name #3139