Closed jakeybob closed 5 years ago
We just ran into perhaps a related problem in parsing code into SQL queries, so I'll add some information here.
One workaround my colleague @tomsing1 found which also works here was to inject !!
on the right hand side of the =
in the filter_at
calls.
I'm no expert in NSE and tideval, but I don't think it's typically expected to work that way(?).
For instance, using dbplyr 1.4.0, using this approach to fix to @jakeybob's problem looks like:
test <- mtcars2 %>%
filter(gear %in% nums, gear == num) %>%
filter_at(cols, any_vars(. %in% !!nums)) %>%
filter_at(cols, any_vars(. == !!num))
Which will give this SQL
SELECT *
FROM (SELECT *
FROM (SELECT *
FROM `mtcars`
WHERE ((`gear` IN (4.0, 5.0)) AND (`gear` = 1.0)))
WHERE (`vs` IN (4.0, 5.0) OR `am` IN (4.0, 5.0) OR `gear` IN (4.0, 5.0) OR `carb` IN (4.0, 5.0)))
WHERE (`vs` = 1.0 OR `am` = 1.0 OR `gear` = 1.0 OR `carb` = 1.0)
I think our problem might be related, because after upgrading to dbplyr_1.4.0, something changed in the translation of dplyr syntax to SQL.
To cast our problem into one that uses the same example: we had some parameters stored in a list (or a 1-row tibble), and we are accessing those values "inline" of our dplyr pipeline.
For instance, like this:
params <- list(num = 1, nums = 4:5)
filter(mtcars2, gear %in% params$nums, gear == params$num)
Generates this SQL:
SELECT *
FROM `mtcars`
WHERE ((`gear` IN (1.0 AS `num`, (4, 5) AS `nums`).(4.0, 5.0)) AND (`gear` = (1.0 AS `num`, (4, 5) AS `nums`).1.0))
Which will fail on collect()
But this works:
filter(mtcars2, gear %in% !!params$nums, gear == !!params$num)
Because it will generate this SQL
SELECT *
FROM `mtcars`
WHERE ((`gear` IN (4, 5)) AND (`gear` = 1.0))
─ Session info ────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.6.0 (2019-04-26)
os macOS Mojave 10.14.4
system x86_64, darwin15.6.0
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2019-05-08
─ Packages ────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
backports 1.1.4 2019-04-10 [1] CRAN (R 3.6.0)
callr 3.2.0 2019-03-15 [1] CRAN (R 3.6.0)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
DBI 1.0.0 2018-05-02 [1] CRAN (R 3.6.0)
dbplyr * 1.4.0 2019-04-23 [1] CRAN (R 3.6.0)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
devtools 2.0.2 2019-04-08 [1] CRAN (R 3.6.0)
digest 0.6.18 2018-10-10 [1] CRAN (R 3.6.0)
dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.6.0)
fs 1.3.0 2019-05-02 [1] CRAN (R 3.6.0)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
pillar 1.3.1 2018-12-15 [1] CRAN (R 3.6.0)
pkgbuild 1.0.3 2019-03-20 [1] CRAN (R 3.6.0)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.6.0)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0)
processx 3.3.0 2019-03-10 [1] CRAN (R 3.6.0)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
purrr 0.3.2 2019-03-15 [1] CRAN (R 3.6.0)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.6.0)
Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.6.0)
remotes 2.0.4 2019-04-10 [1] CRAN (R 3.6.0)
rlang 0.3.4 2019-04-07 [1] CRAN (R 3.6.0)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
testthat 2.1.1 2019-04-23 [1] CRAN (R 3.6.0)
tibble 2.1.1 2019-03-16 [1] CRAN (R 3.6.0)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.0)
usethis 1.5.0 2019-04-07 [1] CRAN (R 3.6.0)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
A Related (?) Problem
@lianos, regarding this, I think this is linked to https://github.com/tidyverse/dplyr/issues/4347 also.
In last dbplyr version, there was a big change regarding sql translation for [[
and $
(https://dbplyr.tidyverse.org/news/index.html#breaking-changes) that allow new ways (https://www.tidyverse.org/articles/2019/04/dbplyr-1-4-0/#sql-translation)
I think this is related to the need of !!
to unquote value before dbplyr translate to SQL.
@jakeybob, the !!
can be useful here to. See this reprex:
library(dplyr)
library(dbplyr)
cols <- c("vs", "am", "gear", "carb")
nums <- c(4, 5)
num <- 4
mtcars2 <- tbl_lazy(mtcars, con = simulate_sqlite())
mtcars2 %>%
filter(gear %in% nums,
gear == num) %>%
filter_at(cols, any_vars(. %in% !!nums)) %>%
filter_at(cols, any_vars(. == !!num))
#> <SQL>
#> SELECT *
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `df`
#> WHERE ((`gear` IN (4.0, 5.0)) AND (`gear` = 4.0)))
#> WHERE (`vs` IN (4.0, 5.0) OR `am` IN (4.0, 5.0) OR `gear` IN (4.0, 5.0) OR `carb` IN (4.0, 5.0)))
#> WHERE (`vs` = 4.0 OR `am` = 4.0 OR `gear` = 4.0 OR `carb` = 4.0)
Created on 2019-05-09 by the reprex package (v0.2.1.9000)
nums
being outside of the data.frame, it could make sense that it requires !!
to unquote before dbplyr translation comes into play (and otherwise consider that nums
and num
are DB variable). I see !!
as a way to tell dbplyr to look for this variable in R. But here, we would need to put !!
everywhere, and also it worked before. So, just a thought...
As it works for filter
without any_vars
and with cols
, any_vars
behavior may be at play here.
The documentation of any_vars(expr)
tells us
expr A predicate expression. This variable supports unquoting and will be evaluated in the context of the data frame. It should return a logical vector. This argument is automatically quoted and later evaluated in the context of the data frame. It supports unquoting. See vignette("programming") for an introduction to these concepts.
it may be related to any_vars
quoting mechanism with dbplyr. nums
and num
in any_vars
must be evaluated from outside the data.frame. !!
tells to evaluate before.
All this does not conclude if it is a bug or breaking change that is intended. It is just hints toward better understanding of what is going here, with a workaround (!!
) to make it work.
Hope it helps.
Thanks @lianos and @cderv, I should've thought about un/quoting! 😜
If it's not a bug it would feel like an odd change to me, as the unquoting is not required when referencing objects in the non-scoped filter
function, or when using filter_at
and any_vars
on a regular non-database tibble. It's not listed in the breaking changes for 1.4.0, unless it's somehow related to the changes in subsetting(?)
Anyway, thanks very much for pointing this out. I think I'll stick with 1.3.0 until this is either fixed or confirmed as an intended change, but if it turns out to be the latter at least I now know how to fix my code. 😆
I’d say that the filter_at
behaviour is a bug. Unfortunately @lianos confused the issue by discussing an unrelated change in behaviour.
@cderv thanks for pointing out the relationship of what I thought was a related problem to the breaking change note in dbplyr, that was quite helpful.
@hadley ouch, that stings a bit. I did first offer a fix to the problem, then tried to help tie this together with something else that looked similar to my naive eye in the hopes it would help the devs out. My mistake. Fortunately I like to pride myself on having thick skin, so ... "I'll be back" (to try to help again), but perhaps consider how that comment might make someone feel if this was their first attempt dipping their toe into the world of helping out with open source projects.
The issue appeared after efd2c178ee3170bdb2779632d77023f1dc2b3b6a when partial_eval
was refactored. The refactoring revealed some issues with any_vars
/ all_vars
and dbplyr
Minimal reprex to help debug shows it concerns any_vars
and all_vars
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
cols <- c("vs", "am")
mtcars2 <- tbl_lazy(mtcars, con = simulate_sqlite())
num <- 1
mtcars2 %>%
filter_at(cols, any_vars(. == num))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`vs` = `num` OR `am` = `num`)
minimum <- 150
mtcars2 %>%
filter_at(cols, all_vars(. > minimum))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`vs` > `minimum` AND `am` > `minimum`)
mtcars2 %>%
filter_all(all_vars(. > minimum))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` > `minimum` AND `cyl` > `minimum` AND `disp` > `minimum` AND `hp` > `minimum` AND `drat` > `minimum` AND `wt` > `minimum` AND `qsec` > `minimum` AND `vs` > `minimum` AND `am` > `minimum` AND `gear` > `minimum` AND `carb` > `minimum`)
mtcars2 %>%
filter_all(any_vars(. == num))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = `num` OR `cyl` = `num` OR `disp` = `num` OR `hp` = `num` OR `drat` = `num` OR `wt` = `num` OR `qsec` = `num` OR `vs` = `num` OR `am` = `num` OR `gear` = `num` OR `carb` = `num`)
Created on 2019-05-10 by the reprex package (v0.2.1.9000)
The issue may be with quo_reduce
in dplyr
that powers filter_*
when using any_vars
and all_vars
. The quosure is built with an environment (base_env()
) that is not parent of caller_env()
so the num
symbol is not found, hence not replaced in the new quosure created by partial evaluation
https://github.com/tidyverse/dbplyr/blob/56afe4045dd6b31230248561fd714b9a760f928f/R/partial-eval.R#L82-L91
I share here some details about what I found
mtcars2 <- dbplyr::tbl_lazy(mtcars, con = dbplyr::simulate_sqlite())
cols <- c("vs", "am")
num <- 4
if we go into detail, the part of filter_at
that deals with any_vars
creates a quosure
where then env is base_env()
syms <- dplyr:::tbl_at_syms(mtcars2, cols)
pred <- dplyr:::apply_filter_syms(dplyr::any_vars(. == num), syms, mtcars)
pred
#> <quosure>
#> expr: ^(^vs == num) | (^am == num)
#> env: package:base
from filter.tbl_lazy
, this env is kept after patial_eval_dots
is applied
dots <- rlang::quos(!!pred)
dbplyr:::partial_eval_dots(dots, vars = dbplyr:::op_vars(mtcars2))
#> [[1]]
#> <quosure>
#> expr: ^(^vs == num) | (^am == num)
#> env: package:base
from partial_eval_dots
, the env of the quosure (base_env()
) is passed in partial_eval
dbplyr:::partial_eval(rlang::get_expr(dots[[1]]), vars = dbplyr:::op_vars(mtcars2), env = rlang::get_env(dots[[1]]))
#> (~vs == num) | ~am == num
dbplyr:::partial_eval(rlang::sym("num"), vars = dbplyr:::op_vars(mtcars2), env = rlang::get_env(dots[[1]]))
#> num
#` `num` is in `global_env()` and won't be found in parent of `base_env()`
exists("num", envir = rlang::base_env(), inherits = TRUE)
#> [1] FALSE
rlang::env_has(rlang::base_env(), "num", inherit = TRUE)
#> num
#> FALSE
There is an environment issue with the quosure built by
dplyr:::apply_filter_syms()
and how dbplyr
resolves what is from R and
what is in the DB using partial_eval()
I am not sure the issue is in dbplyr. It could also be in dplyr when
building the quosure using a joiner function that sets the env to base_env()
through dplyr:::quo_reduce()
.
I hope this little investigation about the source of the bug helps fix it. I am not currently totally at ease with all the tidyeval interaction to make a PR directly. First try broke some tests 🙄
I am seeing what seems like a related issue. Using a function we use for easy Oracle date translation:
ora_to_date <- function(x) {
if (length(x) != 1L) stop("`x` must be of length 1")
if (!any(class(x) %in% c("Date", "POSIXct", "POSIXlt"))) {
stop("`x` must be of class \"Date\", \"POSIXct\", or \"POSIXlt\"")
}
dbplyr::build_sql("to_date(", format(x, "%Y%m%d"), ", 'YYYYMMDD')", con = simulate_dbi())
}
and with
start <- as.Date("2010-01-01")
we get ora_to_date(start)
returned as:
to_date('20100101', 'YYYYMMDD')
But when we pipe dplyr::filter(DATE > ora_to_date(start))
. we get this in the SQL translation:
"DATE" > ORA_TO_DATE('2010-01-01')
Strangely, if we don't use the ora_to_date()
function and just pipe the underlying code:
dplyr::filter(
DATE > dbplyr::build_sql("to_date(", format(start, "%Y%m%d"), ", 'YYYYMMDD')", con = simulate_dbi())
)
it seems to work fine:
"DATE" > to_date('20100101', 'YYYYMMDD')
@keithmcnulty I don't think what you describe is related. It is by design that dbplyr
passes to the database the functions that it does not know a translation for. In this case, ora_to_date
has no known translation in dbplyr
so it is passed as-is in the SQL. See the documentation about translation and its vignette
All other functions will be preserved as is.
So you either need to provide a new translation for your custom function, or evaluate the R code before letting dbplyr translate .
For the later, see how it behave when evaluating using !!
ora_to_date <- function(x) {
if (length(x) != 1L) stop("`x` must be of length 1")
if (!any(class(x) %in% c("Date", "POSIXct", "POSIXlt"))) {
stop("`x` must be of class \"Date\", \"POSIXct\", or \"POSIXlt\"")
}
dbplyr::build_sql("to_date(", format(x, "%Y%m%d"), ", 'YYYYMMDD')", con = dbplyr::simulate_dbi())
}
start <- as.Date("2010-01-01")
library(dplyr, warn.conflicts = FALSE)
dbplyr::lazy_frame(date = as.Date("2010-02-01"), con = dbplyr::simulate_oracle()) %>%
filter(date > ora_to_date(start))
#> <SQL>
#> SELECT *
#> FROM (`df`)
#> WHERE (`date` > ora_to_date('2010-01-01'))
dbplyr::lazy_frame(date = as.Date("2010-02-01"), con = dbplyr::simulate_oracle()) %>%
filter(date > !! ora_to_date(start))
#> <SQL>
#> SELECT *
#> FROM (`df`)
#> WHERE (`date` > to_date('20100101', 'YYYYMMDD'))
Created on 2019-05-28 by the reprex package (v0.3.0.9000)
Hope it helps
@cderv yes you are correct. I now see that there needs to be some sort of namespacing available. Originally my ora_to_date()
function was in a package we built that worked with 1.3.0 when called and namespaced, but we then adjusted it to deal with the new default con = NULL
but tested it out as a local function without any namespacing. Apologies for the distraction and thanks for helping me spot this.
Thanks for the exploration @cderv! I'll take a look next week.
@lionel- do you remember why dplyr:::apply_filter_syms()
creates a quosure with the base environment, rather than the environment of the original quosure?
(@cderv I think your analysis is correct and this is actually a dplyr bug)
The base environment seems correct to me, as it is meant to resolve &
or |
. On the other hand, the operands should be quosures from the user environment, as captured by all_vars()
and any_vars()
.
Oooh so it's probably partial_eval()
not unpeeling the nested quosures correctly.
Hmmm, looks like partial_eval()
already handles this correctly:
library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
num <- 4
mtcars2 <- tbl_lazy(mtcars)
mtcars2 %>%
filter_at(c("mpg", "cyl"), any_vars(. == num))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = `num` OR `cyl` = `num`)
# Construct quosure by hand
quo <- dplyr:::apply_filter_syms(any_vars(. == num), syms(c("mpg", "cyl")))
quo
#> <quosure>
#> expr: ^(^mpg == num) | (^cyl == num)
#> env: package:base
mtcars2 %>%
filter(!!quo)
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = `num` OR `cyl` = `num`)
# _but_ explicitly calling partial_eval works
partial_eval(quo, names(mtcars))
#> <quosure>
#> expr: ^(^mpg == 4) | (^cyl == 4)
#> env: package:base
mtcars2 %>%
filter(!!partial_eval(quo, names(mtcars)))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = 4.0 OR `cyl` = 4.0)
Created on 2019-05-31 by the reprex package (v0.2.1.9000)
Oh the problem is specifically with partial_eval_dots()
library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
num <- 4
quo <- dplyr:::apply_filter_syms(any_vars(. == num), syms(c("mpg", "cyl")))
partial_eval(quo, names(mtcars))
#> <quosure>
#> expr: ^(^mpg == 4) | (^cyl == 4)
#> env: package:base
dbplyr:::partial_eval_dots(quos(!!quo), names(mtcars))
#> [[1]]
#> <quosure>
#> expr: ^(^mpg == num) | (^cyl == num)
#> env: package:base
Created on 2019-05-31 by the reprex package (v0.2.1.9000)
Looks like a 2 line change to partial_eval()
fixes this 😄
dbplyr 1.4.0 appears to translate
filter_at()
calls incorrectly, translating the variable name themselves, rather than their values. See following example:In dbplyr 1.3.0 this gives the following SQL query:
but in dbplyr 1.4.0 the variables are translated literally as "num" and "nums" in the SQL translation of the
filter_at()
function, although they are translated as expected from thefilter()
function.This then leads to an error when querying the database via
collect()
as there are nonum
ornums
variables on the database end.