markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
449 stars 33 forks source link

Use of tidytable functions before `dplyr::collect()` #730

Closed dibbles21 closed 1 year ago

dibbles21 commented 1 year ago

HI Mark,

Perhaps this isn't a tidytable issue, but I find that tidytable functions don't use when querying databases/arrow files when used before dplyr::collect(). I find that I have to specify to use the dplyr versions in such cases. Is this something that could poentially work with tidytable?

Thanks,

Dan

markfairbanks commented 1 year ago

Unfortunately this isn't something that can work with tidytable. In these cases I would just recommend using dplyr/dbplyr/arrow.

One option is to use unloadNamespace("dbplyr") to detach packages before using tidytable.

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

# Querying a fake database
db_table <- memdb_frame(x = 1:3, y = c("a", "a", "b"))

df <- db_table %>%
  select(x, y) %>%
  collect()

# Switch over to tidytable
unloadNamespace("dbplyr")
unloadNamespace("dplyr")
library(tidytable, warn.conflicts = FALSE)

df %>%
  mutate(double_x = x * 2)
#> # A tidytable: 3 × 3
#>       x y     double_x
#>   <int> <chr>    <dbl>
#> 1     1 a            2
#> 2     2 a            4
#> 3     3 b            6

Another option would be to use dtplyr, which allows you to continue using the piping workflow. Though dtplyr has less functionality than tidytable, it integrates much better with packages like dbplyr/arrow.

FYI I am a co-author of dtplyr, so I am working on getting more functions into dtplyr. I doubt it will ever have as many features as tidytable, but hopefully we can get it somewhat close.

Also if you do use dtplyr, I would recommend installing the development version from GitHub - we are in the process of releasing a new version to CRAN.

# Install the latest version
# devtools::install_github("tidyverse/dtplyr")
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
library(dtplyr)

db_table <- memdb_frame(x = 1:3, y = c("a", "a", "b"))

# Using dbplyr & dtplyr together
df <- db_table %>%
  select(x, y) %>%
  collect() %>%
  lazy_dt() %>%
  mutate(double_x = x * 2)

df
#> Source: local data table [3 x 3]
#> Call:   copy(`_DT1`)[, `:=`(double_x = x * 2)]
#> 
#>       x y     double_x
#>   <int> <chr>    <dbl>
#> 1     1 a            2
#> 2     2 a            4
#> 3     3 b            6
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

Hope this helps - if you have any questions let me know.

dibbles21 commented 1 year ago

Thank you Mark for your response, it's great to know what's possible. For now I will continue using dplyr functions before collect. What I love about tidytable is that I don't have to change my legacy dplyr code, I just need to load tidytable after dplyr.

Do you have a rough idea when the next dtplyr version will go to CRAN? It's company policy to only use CRAN packages and not dev versions 😁.

markfairbanks commented 1 year ago

Do you have a rough idea when the next dtplyr version will go to CRAN? It's company policy to only use CRAN packages and not dev versions

Sometime in the next week I would guess. It's been submitted, we're just waiting on CRAN approval.

dibbles21 commented 1 year ago

Thank you

markfairbanks commented 1 year ago

New CRAN version of dtplyr is out now 🥳

dibbles21 commented 1 year ago

Thanks!!