tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

`slice()` unnecessarily evaluates `...` twice for each group #377

Closed eutwt closed 2 years ago

eutwt commented 2 years ago

slice() could be made faster when the arguments in ... are expensive to compute by evaluating ... only once per group. In the example below where the argument to slice takes one second to compute and there are two groups, the code takes ~4 seconds to run instead of ~2.

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.1.2
library(data.table, warn.conflicts = FALSE)

dt <- data.table(id = c(1, 2))

return_1 <- function() {Sys.sleep(1); 1}

library(tictoc)
tic()
dt %>% 
  lazy_dt() %>% 
  group_by(id) %>% 
  slice(return_1())
#> Source: local data table [2 x 1]
#> Groups: id
#> Call:   `_DT1`[`_DT1`[, .I[return_1()[between(return_1(), -.N, .N)]], 
#>     by = .(id)]$V1]
#> 
#>      id
#>   <dbl>
#> 1     1
#> 2     2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
toc()
#> 4.074 sec elapsed

tic()
dt[dt[,
  {
    slice_ind <- return_1()
    .I[slice_ind[between(slice_ind, -.N, .N)]]
  },
  by = .(id)
]$V1]
#>    id
#> 1:  1
#> 2:  2
toc()
#> 2.014 sec elapsed

Created on 2022-07-19 by the reprex package (v2.0.1)

eutwt commented 2 years ago

Oops. I missed that, thanks to https://github.com/Rdatatable/data.table/pull/4353, this can be fixed more easily by using nomatch = NULL once data.table releases and we bump the required version. Probably not worth fixing before then

markfairbanks commented 2 years ago

I sort of still like the idea of doing this one 🤷‍♂️

data.table seems to be pretty slow getting this release out. If we release before they do it'd be nice to have.