Closed hadley closed 3 years ago
Ooooooh, the problem is that dtplyr always creates a unique name for the intermediate table it now needs for the grouped filter:
_DT10 <- `_DT2`[between(distance, 200, 300) & !is.na(air_time)]
head(`_DT10`[`_DT10`[, .I[sum(!is.na(flight)) > 3000], by = .(origin,
dest)]$V1, .(num_flts = sum(!is.na(flight)), dist = round(mean(distance,
na.rm = TRUE)), avg_delay = round(mean(arr_delay, na.rm = TRUE))),
keyby = .(origin, dest)][order(desc(num_flts), desc(avg_delay))],
n = 100L)
And since the name is session unique, it's incremented between the two tests.
Maybe I can make the name unique with in a pipeline.
I'm having difficulties figuring out how to do this better because independent pipelines might get combined into one with a left_join()
, so the locals have to be globally unique, not just locally unique. Let me know if you have any ideas.
@hadley Thanks for looking into this. In these tests, I suppose I should have used as.data.frame()
to test for equality of the result sets, not for equality of the returned dtplyr_step
object.
I just committed that change (7aa7dbb), but now I'm getting some strange test errors. I'll try to debug. https://travis-ci.org/github/ianmcook/tidyquery/builds/757640217#L3025
Here's a minimal reprex of the type of failure I'm seeing after adding as.data.frame()
in 7aa7dbb:
If set up a testthat file like this:
library(dtplyr)
library(dplyr)
iris_dt <- lazy_dt(iris)
test_that("testthat works with dtplyr", {
iris_dt %>% select(Sepal.Length) %>% as.data.frame()
succeed()
})
then I run devtools::test()
and I get this error:
Error: could not find function "."
If I just run the code outside of test()
it runs with no error. But then after test()
runs, something changes in the environment that causes it to fail even running outside of test()
.
The traceback is:
9: `[.data.frame`(x, i, j)
8: `[.data.table`(`_DT12`, , .(Sepal.Length))
7: `_DT12`[, .(Sepal.Length)]
6: eval_tidy(quo) at tidyeval.R#11
5: dt_eval(x) at step.R#144
4: as.data.frame(dt_eval(x)) at step.R#144
3: as.data.frame.dtplyr_step(.)
2: as.data.frame(.)
1: iris_dt %>% select(Sepal.Length) %>% as.data.frame()
So [.data.table
is dispatching to [.data.frame
which obviously doesn't know how to treat .
in this context.
I don't know why this behaviour would have changed, but I suspect it's caused by this commit: https://github.com/tidyverse/dtplyr/commit/5f8d51be4121cdfa645285d809c9ae79b68d6fd7.
Yeah, with options(datatable.verbose = TRUE)
, I see:
cedta decided 'tidyquery' wasn't data.table aware. Here is call stack with [[1L]] applied:
[[1]]
`%>%`
[[2]]
as.data.frame
[[3]]
as.data.frame.dtplyr_step
[[4]]
as.data.frame
[[5]]
dt_eval
[[6]]
eval_tidy
[[7]]
`[`
[[8]]
`[.data.table`
[[9]]
cedta
I can fix this in dtplyr.
The problem is that lazy_dt()
captures the calling environment, and that's what data.table uses to figure out if it's enabled or not. It works in dtplyr itself because it imports data.table, but tidyquery does not. (And it shouldn't need to; I just need to figure out how to tell data.table that we're good.)
Looking at the code in https://github.com/Rdatatable/data.table/blob/master/R/cedta.R, it's not clear how to force this to be true; any approach is going to be hacky. The easiest fix would be for you to include .datatable.aware <- TRUE
somewhere in your package while I think about how to handle this.
Fixed in 7aa7dbbb36d784e89e6ec309de7441e0ebc72e1d and 7cc5c71ec93c4784a6ec3eaa821bfde179a0e345. Version 0.2.2 on its way to CRAN.
This test now fails:
Unfortunately you don't get a particularly informative error (even with
local_edition(3)
) because the pipeline is rather deep. However, I think this is the key difference:i.e. expected is generating one additional intermediate data table name than expected — this is probably due to the new grouped filter behaviour. Indeed, if I remove
filter(sum(!is.na(flight)) > 3000)
andHAVING num_flts > 3000
the test passes