tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
478 stars 173 forks source link

filter_at() translation issue #296

Closed jakeybob closed 5 years ago

jakeybob commented 5 years ago

dbplyr 1.4.0 appears to translate filter_at() calls incorrectly, translating the variable name themselves, rather than their values. See following example:

library(dplyr)
library(dbplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)

mtcars2 <- tbl(con, "mtcars")
cols <- c("vs", "am", "gear", "carb")
nums <- c(4, 5)
num <- 1

test <- mtcars2 %>%
  filter(gear %in% nums, 
         gear == num) %>%
  filter_at(cols, any_vars(. %in% nums)) %>%
  filter_at(cols, any_vars(. == num))

show_query(test)

In dbplyr 1.3.0 this gives the following SQL query:

SELECT *
FROM (SELECT *
FROM (SELECT *
FROM `mtcars`
WHERE ((`gear` IN (4.0, 5.0)) AND (`gear` = 1.0)))
WHERE (`vs` IN (4.0, 5.0) OR `am` IN (4.0, 5.0) OR `gear` IN (4.0, 5.0) OR `carb` IN (4.0, 5.0)))
WHERE (`vs` = 1.0 OR `am` = 1.0 OR `gear` = 1.0 OR `carb` = 1.0)

but in dbplyr 1.4.0 the variables are translated literally as "num" and "nums" in the SQL translation of the filter_at() function, although they are translated as expected from the filter() function.

SELECT *
FROM (SELECT *
FROM (SELECT *
FROM `mtcars`
WHERE ((`gear` IN (4.0, 5.0)) AND (`gear` = 1.0)))
WHERE (`vs` IN `nums` OR `am` IN `nums` OR `gear` IN `nums` OR `carb` IN `nums`))
WHERE (`vs` = `num` OR `am` = `num` OR `gear` = `num` OR `carb` = `num`)

This then leads to an error when querying the database via collect() as there are no num or nums variables on the database end.

lianos commented 5 years ago

We just ran into perhaps a related problem in parsing code into SQL queries, so I'll add some information here.

One workaround my colleague @tomsing1 found which also works here was to inject !! on the right hand side of the = in the filter_at calls.

I'm no expert in NSE and tideval, but I don't think it's typically expected to work that way(?).

For instance, using dbplyr 1.4.0, using this approach to fix to @jakeybob's problem looks like:

test <- mtcars2 %>%
  filter(gear %in% nums, gear == num) %>%
  filter_at(cols, any_vars(. %in% !!nums)) %>%
  filter_at(cols, any_vars(. == !!num))

Which will give this SQL

SELECT *
FROM (SELECT *
FROM (SELECT *
FROM `mtcars`
WHERE ((`gear` IN (4.0, 5.0)) AND (`gear` = 1.0)))
WHERE (`vs` IN (4.0, 5.0) OR `am` IN (4.0, 5.0) OR `gear` IN (4.0, 5.0) OR `carb` IN (4.0, 5.0)))
WHERE (`vs` = 1.0 OR `am` = 1.0 OR `gear` = 1.0 OR `carb` = 1.0)

A Related (?) Problem

I think our problem might be related, because after upgrading to dbplyr_1.4.0, something changed in the translation of dplyr syntax to SQL.

To cast our problem into one that uses the same example: we had some parameters stored in a list (or a 1-row tibble), and we are accessing those values "inline" of our dplyr pipeline.

For instance, like this:

params <- list(num = 1, nums = 4:5)
filter(mtcars2, gear %in% params$nums, gear == params$num)

Generates this SQL:

SELECT *
FROM `mtcars`
WHERE ((`gear` IN (1.0 AS `num`, (4, 5) AS `nums`).(4.0, 5.0)) AND (`gear` = (1.0 AS `num`, (4, 5) AS `nums`).1.0))

Which will fail on collect()

But this works:

filter(mtcars2, gear %in% !!params$nums, gear == !!params$num)

Because it will generate this SQL

SELECT *
FROM `mtcars`
WHERE ((`gear` IN (4, 5)) AND (`gear` = 1.0))

Session Info

─ Session info ────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.0 (2019-04-26)
 os       macOS Mojave 10.14.4        
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2019-05-08                  

─ Packages ────────────────────────────────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
 backports     1.1.4   2019-04-10 [1] CRAN (R 3.6.0)
 callr         3.2.0   2019-03-15 [1] CRAN (R 3.6.0)
 cli           1.1.0   2019-03-19 [1] CRAN (R 3.6.0)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
 DBI           1.0.0   2018-05-02 [1] CRAN (R 3.6.0)
 dbplyr      * 1.4.0   2019-04-23 [1] CRAN (R 3.6.0)
 desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
 devtools      2.0.2   2019-04-08 [1] CRAN (R 3.6.0)
 digest        0.6.18  2018-10-10 [1] CRAN (R 3.6.0)
 dplyr       * 0.8.0.1 2019-02-15 [1] CRAN (R 3.6.0)
 fs            1.3.0   2019-05-02 [1] CRAN (R 3.6.0)
 glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
 pillar        1.3.1   2018-12-15 [1] CRAN (R 3.6.0)
 pkgbuild      1.0.3   2019-03-20 [1] CRAN (R 3.6.0)
 pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.6.0)
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
 prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.6.0)
 processx      3.3.0   2019-03-10 [1] CRAN (R 3.6.0)
 ps            1.3.0   2018-12-21 [1] CRAN (R 3.6.0)
 purrr         0.3.2   2019-03-15 [1] CRAN (R 3.6.0)
 R6            2.4.0   2019-02-14 [1] CRAN (R 3.6.0)
 Rcpp          1.0.1   2019-03-17 [1] CRAN (R 3.6.0)
 remotes       2.0.4   2019-04-10 [1] CRAN (R 3.6.0)
 rlang         0.3.4   2019-04-07 [1] CRAN (R 3.6.0)
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
 rstudioapi    0.10    2019-03-19 [1] CRAN (R 3.6.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
 testthat      2.1.1   2019-04-23 [1] CRAN (R 3.6.0)
 tibble        2.1.1   2019-03-16 [1] CRAN (R 3.6.0)
 tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.6.0)
 usethis       1.5.0   2019-04-07 [1] CRAN (R 3.6.0)
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
cderv commented 5 years ago

A Related (?) Problem

@lianos, regarding this, I think this is linked to https://github.com/tidyverse/dplyr/issues/4347 also.

In last dbplyr version, there was a big change regarding sql translation for [[ and $ (https://dbplyr.tidyverse.org/news/index.html#breaking-changes) that allow new ways (https://www.tidyverse.org/articles/2019/04/dbplyr-1-4-0/#sql-translation)

I think this is related to the need of !! to unquote value before dbplyr translate to SQL.

@jakeybob, the !! can be useful here to. See this reprex:

library(dplyr)
library(dbplyr)

cols <- c("vs", "am", "gear", "carb")
nums <- c(4, 5)
num <- 4

mtcars2 <- tbl_lazy(mtcars, con = simulate_sqlite())

mtcars2 %>%
  filter(gear %in% nums, 
         gear == num) %>%
  filter_at(cols, any_vars(. %in% !!nums)) %>%
  filter_at(cols, any_vars(. == !!num))
#> <SQL>
#> SELECT *
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `df`
#> WHERE ((`gear` IN (4.0, 5.0)) AND (`gear` = 4.0)))
#> WHERE (`vs` IN (4.0, 5.0) OR `am` IN (4.0, 5.0) OR `gear` IN (4.0, 5.0) OR `carb` IN (4.0, 5.0)))
#> WHERE (`vs` = 4.0 OR `am` = 4.0 OR `gear` = 4.0 OR `carb` = 4.0)

Created on 2019-05-09 by the reprex package (v0.2.1.9000)

nums being outside of the data.frame, it could make sense that it requires !! to unquote before dbplyr translation comes into play (and otherwise consider that nums and num are DB variable). I see !! as a way to tell dbplyr to look for this variable in R. But here, we would need to put !! everywhere, and also it worked before. So, just a thought...

As it works for filter without any_vars and with cols, any_vars behavior may be at play here. The documentation of any_vars(expr) tells us

expr A predicate expression. This variable supports unquoting and will be evaluated in the context of the data frame. It should return a logical vector. This argument is automatically quoted and later evaluated in the context of the data frame. It supports unquoting. See vignette("programming") for an introduction to these concepts.

it may be related to any_vars quoting mechanism with dbplyr. nums and num in any_vars must be evaluated from outside the data.frame. !! tells to evaluate before.

All this does not conclude if it is a bug or breaking change that is intended. It is just hints toward better understanding of what is going here, with a workaround (!!) to make it work.

Hope it helps.

jakeybob commented 5 years ago

Thanks @lianos and @cderv, I should've thought about un/quoting! 😜

If it's not a bug it would feel like an odd change to me, as the unquoting is not required when referencing objects in the non-scoped filter function, or when using filter_at and any_vars on a regular non-database tibble. It's not listed in the breaking changes for 1.4.0, unless it's somehow related to the changes in subsetting(?)

Anyway, thanks very much for pointing this out. I think I'll stick with 1.3.0 until this is either fixed or confirmed as an intended change, but if it turns out to be the latter at least I now know how to fix my code. 😆

hadley commented 5 years ago

I’d say that the filter_at behaviour is a bug. Unfortunately @lianos confused the issue by discussing an unrelated change in behaviour.

lianos commented 5 years ago

@cderv thanks for pointing out the relationship of what I thought was a related problem to the breaking change note in dbplyr, that was quite helpful.

@hadley ouch, that stings a bit. I did first offer a fix to the problem, then tried to help tie this together with something else that looked similar to my naive eye in the hopes it would help the devs out. My mistake. Fortunately I like to pride myself on having thick skin, so ... "I'll be back" (to try to help again), but perhaps consider how that comment might make someone feel if this was their first attempt dipping their toe into the world of helping out with open source projects.

cderv commented 5 years ago

The issue appeared after efd2c178ee3170bdb2779632d77023f1dc2b3b6a when partial_eval was refactored. The refactoring revealed some issues with any_vars / all_vars and dbplyr

Minimal reprex to help debug shows it concerns any_vars and all_vars

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

cols <- c("vs", "am")

mtcars2 <- tbl_lazy(mtcars, con = simulate_sqlite())

num <- 1
mtcars2 %>%
  filter_at(cols, any_vars(. == num))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`vs` = `num` OR `am` = `num`)

minimum <- 150
mtcars2 %>%
  filter_at(cols, all_vars(. > minimum))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`vs` > `minimum` AND `am` > `minimum`)

mtcars2 %>%
  filter_all(all_vars(. > minimum))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` > `minimum` AND `cyl` > `minimum` AND `disp` > `minimum` AND `hp` > `minimum` AND `drat` > `minimum` AND `wt` > `minimum` AND `qsec` > `minimum` AND `vs` > `minimum` AND `am` > `minimum` AND `gear` > `minimum` AND `carb` > `minimum`)

mtcars2 %>%
  filter_all(any_vars(. == num))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = `num` OR `cyl` = `num` OR `disp` = `num` OR `hp` = `num` OR `drat` = `num` OR `wt` = `num` OR `qsec` = `num` OR `vs` = `num` OR `am` = `num` OR `gear` = `num` OR `carb` = `num`)

Created on 2019-05-10 by the reprex package (v0.2.1.9000)

The issue may be with quo_reduce in dplyr that powers filter_* when using any_vars and all_vars. The quosure is built with an environment (base_env()) that is not parent of caller_env() so the num symbol is not found, hence not replaced in the new quosure created by partial evaluation https://github.com/tidyverse/dbplyr/blob/56afe4045dd6b31230248561fd714b9a760f928f/R/partial-eval.R#L82-L91

I share here some details about what I found

mtcars2 <- dbplyr::tbl_lazy(mtcars, con = dbplyr::simulate_sqlite())
cols <- c("vs", "am")
num <- 4

if we go into detail, the part of filter_at that deals with any_vars creates a quosure where then env is base_env()

syms <- dplyr:::tbl_at_syms(mtcars2, cols)
pred <- dplyr:::apply_filter_syms(dplyr::any_vars(. == num), syms, mtcars)
pred
#> <quosure>
#> expr: ^(^vs == num) | (^am == num)
#> env:  package:base

from filter.tbl_lazy, this env is kept after patial_eval_dots is applied

dots <- rlang::quos(!!pred)
dbplyr:::partial_eval_dots(dots, vars = dbplyr:::op_vars(mtcars2))
#> [[1]]
#> <quosure>
#> expr: ^(^vs == num) | (^am == num)
#> env:  package:base

from partial_eval_dots, the env of the quosure (base_env()) is passed in partial_eval

dbplyr:::partial_eval(rlang::get_expr(dots[[1]]), vars = dbplyr:::op_vars(mtcars2), env = rlang::get_env(dots[[1]]))
#> (~vs == num) | ~am == num
dbplyr:::partial_eval(rlang::sym("num"), vars = dbplyr:::op_vars(mtcars2), env = rlang::get_env(dots[[1]]))
#> num
#`  `num` is in `global_env()` and won't be found in parent of `base_env()`
exists("num", envir = rlang::base_env(), inherits = TRUE)
#> [1] FALSE
rlang::env_has(rlang::base_env(), "num", inherit = TRUE)
#>   num 
#> FALSE

There is an environment issue with the quosure built by dplyr:::apply_filter_syms() and how dbplyr resolves what is from R and what is in the DB using partial_eval() I am not sure the issue is in dbplyr. It could also be in dplyr when building the quosure using a joiner function that sets the env to base_env() through dplyr:::quo_reduce().

I hope this little investigation about the source of the bug helps fix it. I am not currently totally at ease with all the tidyeval interaction to make a PR directly. First try broke some tests 🙄

keithmcnulty commented 5 years ago

I am seeing what seems like a related issue. Using a function we use for easy Oracle date translation:

ora_to_date <- function(x) {
  if (length(x) != 1L) stop("`x` must be of length 1")

  if (!any(class(x) %in% c("Date", "POSIXct", "POSIXlt"))) {
    stop("`x` must be of class \"Date\", \"POSIXct\", or \"POSIXlt\"")
  }

  dbplyr::build_sql("to_date(", format(x, "%Y%m%d"), ", 'YYYYMMDD')", con = simulate_dbi())
}

and with

start <- as.Date("2010-01-01")

we get ora_to_date(start) returned as:

to_date('20100101', 'YYYYMMDD')

But when we pipe dplyr::filter(DATE > ora_to_date(start)) . we get this in the SQL translation: "DATE" > ORA_TO_DATE('2010-01-01')

Strangely, if we don't use the ora_to_date() function and just pipe the underlying code:

dplyr::filter(
  DATE > dbplyr::build_sql("to_date(",  format(start, "%Y%m%d"), ", 'YYYYMMDD')",  con = simulate_dbi())
)

it seems to work fine:

"DATE" > to_date('20100101', 'YYYYMMDD')

cderv commented 5 years ago

@keithmcnulty I don't think what you describe is related. It is by design that dbplyr passes to the database the functions that it does not know a translation for. In this case, ora_to_date has no known translation in dbplyr so it is passed as-is in the SQL. See the documentation about translation and its vignette

All other functions will be preserved as is.

So you either need to provide a new translation for your custom function, or evaluate the R code before letting dbplyr translate . For the later, see how it behave when evaluating using !!

ora_to_date <- function(x) {
  if (length(x) != 1L) stop("`x` must be of length 1")

  if (!any(class(x) %in% c("Date", "POSIXct", "POSIXlt"))) {
    stop("`x` must be of class \"Date\", \"POSIXct\", or \"POSIXlt\"")
  }

  dbplyr::build_sql("to_date(", format(x, "%Y%m%d"), ", 'YYYYMMDD')", con = dbplyr::simulate_dbi())
}

start <- as.Date("2010-01-01")

library(dplyr, warn.conflicts = FALSE)
dbplyr::lazy_frame(date = as.Date("2010-02-01"), con = dbplyr::simulate_oracle()) %>%
  filter(date > ora_to_date(start))
#> <SQL>
#> SELECT *
#> FROM (`df`) 
#> WHERE (`date` > ora_to_date('2010-01-01'))
dbplyr::lazy_frame(date = as.Date("2010-02-01"), con = dbplyr::simulate_oracle()) %>%
  filter(date > !! ora_to_date(start))
#> <SQL>
#> SELECT *
#> FROM (`df`) 
#> WHERE (`date` > to_date('20100101', 'YYYYMMDD'))

Created on 2019-05-28 by the reprex package (v0.3.0.9000)

Hope it helps

keithmcnulty commented 5 years ago

@cderv yes you are correct. I now see that there needs to be some sort of namespacing available. Originally my ora_to_date() function was in a package we built that worked with 1.3.0 when called and namespaced, but we then adjusted it to deal with the new default con = NULL but tested it out as a local function without any namespacing. Apologies for the distraction and thanks for helping me spot this.

hadley commented 5 years ago

Thanks for the exploration @cderv! I'll take a look next week.

hadley commented 5 years ago

@lionel- do you remember why dplyr:::apply_filter_syms() creates a quosure with the base environment, rather than the environment of the original quosure?

(@cderv I think your analysis is correct and this is actually a dplyr bug)

lionel- commented 5 years ago

The base environment seems correct to me, as it is meant to resolve & or |. On the other hand, the operands should be quosures from the user environment, as captured by all_vars() and any_vars().

hadley commented 5 years ago

Oooh so it's probably partial_eval() not unpeeling the nested quosures correctly.

hadley commented 5 years ago

Hmmm, looks like partial_eval() already handles this correctly:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

num <- 4
mtcars2 <- tbl_lazy(mtcars)
mtcars2 %>% 
  filter_at(c("mpg", "cyl"), any_vars(. == num))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = `num` OR `cyl` = `num`)

# Construct quosure by hand
quo <- dplyr:::apply_filter_syms(any_vars(. == num), syms(c("mpg", "cyl")))
quo
#> <quosure>
#> expr: ^(^mpg == num) | (^cyl == num)
#> env:  package:base
mtcars2 %>% 
  filter(!!quo)
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = `num` OR `cyl` = `num`)

# _but_ explicitly calling partial_eval works 
partial_eval(quo, names(mtcars))
#> <quosure>
#> expr: ^(^mpg == 4) | (^cyl == 4)
#> env:  package:base

mtcars2 %>% 
  filter(!!partial_eval(quo, names(mtcars)))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = 4.0 OR `cyl` = 4.0)

Created on 2019-05-31 by the reprex package (v0.2.1.9000)

hadley commented 5 years ago

Oh the problem is specifically with partial_eval_dots()

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

num <- 4
quo <- dplyr:::apply_filter_syms(any_vars(. == num), syms(c("mpg", "cyl")))

partial_eval(quo, names(mtcars))
#> <quosure>
#> expr: ^(^mpg == 4) | (^cyl == 4)
#> env:  package:base

dbplyr:::partial_eval_dots(quos(!!quo), names(mtcars))
#> [[1]]
#> <quosure>
#> expr: ^(^mpg == num) | (^cyl == num)
#> env:  package:base

Created on 2019-05-31 by the reprex package (v0.2.1.9000)

hadley commented 5 years ago

Looks like a 2 line change to partial_eval() fixes this 😄