tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

compute() should enforce computation #350

Closed mgacc0 closed 2 years ago

mgacc0 commented 2 years ago

I suspect dplyr::compute() is not enforcing computation.

For example:

df1 <- tibble(
  letter = sample(letters, 1e8, replace = TRUE),
  n = sample(100, 1e8, replace = TRUE)
) %>%
  lazy_dt() %>%
  group_by(letter) %>%
  summarise(n = sum(n))

The call to compute() returns inmediatly (without doing calculations):

{
  tictoc::tic()
  df_m <- df1 %>%
    compute()
  tictoc::toc()
}
# 0 sec elapsed

And to enforce the execution, I'm using (provisionally) this alternative:

df_m_2 <- df1 %>%
  collect() %>%
  lazy_dt()

Both should be true:

expect_s3_class(df_m, "dtplyr_step_first")
# Error: `df_m` inherits from 'dtplyr_step_group'/'dtplyr_step' not 'dtplyr_step_first'.
expect_s3_class(df_m_2, "dtplyr_step_first")
markfairbanks commented 2 years ago

This is mentioned in the documentation here. The point of compute() in dtplyr is to "generate an intermediate assignment in the translation", so this is working as intended.

collect(), as_tibble(), as.data.frame(), and as.data.table() actually enforce the computation.

mgacc0 commented 2 years ago

Then, I will continue using

df1 <- df1 %>%
  collect() %>%
  lazy_dt()

to enforce computation.

As you know, this is an optimization to improve performance (when nrows(df1) > 10M): caching df1 if it is going to be reused multiple times.