tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

dtplyr works in local function, but errors with same function in a package #398

Closed eipi10 closed 2 years ago

eipi10 commented 2 years ago

I'm trying to use dtplyr to speed up some functions in a personal package. However, I'm finding that code and functions that run properly when defined in a local session, give an error when defined in a package. If I have the package import data.table (by adding @import data.table) then the error goes away, but I thought data.table didn't need to be loaded for dtplyr to work.

Here's a reprex (except for the fact that you don't have access to the package, but I show below the how the function is defined in the package):

library(tidyverse)
library(dtplyr)
options(tibble.print_min=3)

# Package with a function that uses dtplyr
library(test.dtplyr)

# Here's the function definition in the package:
# #' @export
# my_fnc = function(x) {
#
#   x %>%
#     lazy_dt() %>%
#     group_by(id, year) %>%
#     summarise(across(matches("^v"), sum)) %>%
#     as_tibble()
# }

# Fake data
set.seed(4958)
d = crossing(
  id = str_pad(1:1e4, width=5, side="left", pad="0"),
  year = 2010:2020,
  term = c("Fall", "Spring", "Summer")
) %>% 
  mutate(term = paste(term, year)) %>% 
  mutate(
    v1 = sample(0:15, n(), replace=TRUE),
    v2 = sample(0:5, n(), replace=TRUE),
  )

# This function is exactly the same as in the package, but defined locally
my_fnc2 = function(x) {

  x %>%
    lazy_dt() %>%
    group_by(id, year) %>%
    summarise(across(matches("^v"), sum)) %>%
    as_tibble()
}

# dtplyr
d %>% 
  lazy_dt() %>% 
  group_by(id, year) %>% 
  summarise(across(matches("^v"), sum)) %>% 
  as_tibble()
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 110,000 x 4
#>   id     year    v1    v2
#>   <chr> <int> <int> <int>
#> 1 00001  2010    32     7
#> 2 00001  2011    14     3
#> 3 00001  2012    17     2
#> # ... with 109,997 more rows

# dtplyr with function created locally above
d %>% my_fnc2()
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 110,000 x 4
#>   id     year    v1    v2
#>   <chr> <int> <int> <int>
#> 1 00001  2010    32     7
#> 2 00001  2011    14     3
#> 3 00001  2012    17     2
#> # ... with 109,997 more rows

# dtplyr with exact same function, but called from package test.dtplyr
d %>% my_fnc()
#> Error in .(v1 = sum(v1), v2 = sum(v2)): could not find function "."

Created on 2022-11-10 with reprex v2.0.2

markfairbanks commented 2 years ago

This is because of a quirk with how data.table works. Even though dtplyr is using data.table, you also need to specify in your package that data.table is being used.

To do this you need to define an environment variable .datatable.aware <- TRUE once somewhere in your package.

#' @export
my_fnc = function(x) {

  x %>%
    lazy_dt() %>%
    group_by(id, year) %>%
    summarise(across(matches("^v"), sum)) %>%
    as_tibble()
}

.datatable.aware <- TRUE

@hadley already submitted a fix for this to data.table, and we're just waiting for their next release to CRAN (hopefully sometime in the next few months).

You can track the issue from our end here.

I'm going to close this issue, but if you have any questions let me know.

eipi10 commented 2 years ago

I'm glad it was an easy fix. Thanks for your help!