ianmcook / tidyquery

Query R data frames with SQL
Apache License 2.0
167 stars 12 forks source link

Failing test with next version of dtplyr #17

Closed hadley closed 3 years ago

hadley commented 3 years ago

This test now fails:

  expect_equal(
    query(
      "SELECT origin, dest,
          COUNT(flight) AS num_flts,
          round(AVG(distance)) AS dist,
          round(AVG(arr_delay)) AS avg_delay
          FROM flights_dt
        WHERE distance BETWEEN 200 AND 300
          AND air_time IS NOT NULL
        GROUP BY origin, dest
        HAVING num_flts > 3000
        ORDER BY num_flts DESC, avg_delay DESC
        LIMIT 100;"
    ),
    flights_dt %>%
      filter(between(distance,200,300) & !is.na(air_time)) %>%
      group_by(origin, dest) %>%
      filter(sum(!is.na(flight)) > 3000) %>%
      summarise(
        num_flts = sum(!is.na(flight)),
        dist = round(mean(distance, na.rm = TRUE)),
        avg_delay = round(mean(arr_delay, na.rm = TRUE))
      ) %>%
      ungroup() %>%
      arrange(desc(num_flts), desc(avg_delay)) %>%
      head(100L)
  )

Unfortunately you don't get a particularly informative error (even with local_edition(3)) because the pipeline is rather deep. However, I think this is the key difference:

actual$parent$parent$parent$parent$parent$i vs expected$parent$parent$parent$parent$parent$i
- `\`_DT3\`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, dest)]$V1`
+ `\`_DT4\`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, dest)]$V1`

i.e. expected is generating one additional intermediate data table name than expected — this is probably due to the new grouped filter behaviour. Indeed, if I remove filter(sum(!is.na(flight)) > 3000) and HAVING num_flts > 3000 the test passes

hadley commented 3 years ago

Ooooooh, the problem is that dtplyr always creates a unique name for the intermediate table it now needs for the grouped filter:

_DT10 <- `_DT2`[between(distance, 200, 300) & !is.na(air_time)]
  head(`_DT10`[`_DT10`[, .I[sum(!is.na(flight)) > 3000], by = .(origin, 
    dest)]$V1, .(num_flts = sum(!is.na(flight)), dist = round(mean(distance, 
    na.rm = TRUE)), avg_delay = round(mean(arr_delay, na.rm = TRUE))), 
    keyby = .(origin, dest)][order(desc(num_flts), desc(avg_delay))], 
    n = 100L)

And since the name is session unique, it's incremented between the two tests.

Maybe I can make the name unique with in a pipeline.

hadley commented 3 years ago

I'm having difficulties figuring out how to do this better because independent pipelines might get combined into one with a left_join(), so the locals have to be globally unique, not just locally unique. Let me know if you have any ideas.

ianmcook commented 3 years ago

@hadley Thanks for looking into this. In these tests, I suppose I should have used as.data.frame() to test for equality of the result sets, not for equality of the returned dtplyr_step object.

I just committed that change (7aa7dbb), but now I'm getting some strange test errors. I'll try to debug. https://travis-ci.org/github/ianmcook/tidyquery/builds/757640217#L3025

ianmcook commented 3 years ago

Here's a minimal reprex of the type of failure I'm seeing after adding as.data.frame() in 7aa7dbb:

If set up a testthat file like this:

library(dtplyr)
library(dplyr)

iris_dt <- lazy_dt(iris)

test_that("testthat works with dtplyr", {
  iris_dt %>% select(Sepal.Length) %>% as.data.frame()
  succeed()
})

then I run devtools::test() and I get this error:

Error: could not find function "."

If I just run the code outside of test() it runs with no error. But then after test() runs, something changes in the environment that causes it to fail even running outside of test().

hadley commented 3 years ago

The traceback is:

9: `[.data.frame`(x, i, j)
8: `[.data.table`(`_DT12`, , .(Sepal.Length))
7: `_DT12`[, .(Sepal.Length)]
6: eval_tidy(quo) at tidyeval.R#11
5: dt_eval(x) at step.R#144
4: as.data.frame(dt_eval(x)) at step.R#144
3: as.data.frame.dtplyr_step(.)
2: as.data.frame(.)
1: iris_dt %>% select(Sepal.Length) %>% as.data.frame()

So [.data.table is dispatching to [.data.frame which obviously doesn't know how to treat . in this context.

I don't know why this behaviour would have changed, but I suspect it's caused by this commit: https://github.com/tidyverse/dtplyr/commit/5f8d51be4121cdfa645285d809c9ae79b68d6fd7.

hadley commented 3 years ago

Yeah, with options(datatable.verbose = TRUE), I see:

cedta decided 'tidyquery' wasn't data.table aware. Here is call stack with [[1L]] applied:
[[1]]
`%>%`

[[2]]
as.data.frame

[[3]]
as.data.frame.dtplyr_step

[[4]]
as.data.frame

[[5]]
dt_eval

[[6]]
eval_tidy

[[7]]
`[`

[[8]]
`[.data.table`

[[9]]
cedta

I can fix this in dtplyr.

hadley commented 3 years ago

The problem is that lazy_dt() captures the calling environment, and that's what data.table uses to figure out if it's enabled or not. It works in dtplyr itself because it imports data.table, but tidyquery does not. (And it shouldn't need to; I just need to figure out how to tell data.table that we're good.)

hadley commented 3 years ago

Looking at the code in https://github.com/Rdatatable/data.table/blob/master/R/cedta.R, it's not clear how to force this to be true; any approach is going to be hacky. The easiest fix would be for you to include .datatable.aware <- TRUE somewhere in your package while I think about how to handle this.

ianmcook commented 3 years ago

Fixed in 7aa7dbbb36d784e89e6ec309de7441e0ebc72e1d and 7cc5c71ec93c4784a6ec3eaa821bfde179a0e345. Version 0.2.2 on its way to CRAN.