tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

select(where(...)) does not return what's expected #392

Closed lschneiderbauer closed 2 years ago

lschneiderbauer commented 2 years ago

Running a select-statement on a lazy data table in combination with where does not return what I expect:

library(dplyr)
#> Warning: Paket 'dplyr' wurde unter R Version 4.0.5 erstellt
#> 
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dtplyr)
#> Warning: Paket 'dtplyr' wurde unter R Version 4.0.5 erstellt

data <- dtplyr::lazy_dt(tibble(x=c(1,2)))
data
#> Source: local data table [2 x 1]
#> Call:   `_DT1`
#> 
#>       x
#>   <dbl>
#> 1     1
#> 2     2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

data %>% select(where(~all(is.numeric(.))))
#> Source: local data table [0 x 0]
#> Call:   `_DT1`[, 0L]
#> 
#> 
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
data %>% select(where(~any(!is.na(.))))
#> Source: local data table [0 x 0]
#> Call:   `_DT1`[, 0L]
#> 
#> 
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

Created on 2022-09-08 by the reprex package (v2.0.1)

All cases return an empty set of data while I expect all the data to still be present, since the conditions are satisfied (all columns are numeric and no value is NA).

markfairbanks commented 2 years ago

Since dtplyr uses lazy evaluation, the use of where() is not supported. Unfortunately there is no way to know the type of a column in a lazy workflow.

If you download the development version of dtplyr there is now an error message if you try to use where().

# devtools::install_github("tidyverse/dtplyr")

library(dplyr, warn.conflicts = FALSE)
library(dtplyr)

data <- lazy_dt(tibble(x=c(1,2)))

data %>% select(where(~all(is.numeric(.))))
#> Error in `select()`:
#> ! The use of `where()` is not supported by dtplyr.

Hope this helps - if you have any questions let me know.

sergiocorreia commented 2 years ago

@markfairbanks thanks for the quick response; I also encountered and was surprised by the same issue today. This limitation (not evident anywhere in the documentation) makes it a bit harder to use dtplyr as a plug-in alternative to dplyr.

Perhaps an alternative would be to automatically call as.data.table() when encountering where()? This could even be off by default, only on by an option to lazy_dt()?

Also, at this point are you aware of other things that can be done in e.g. tidytable that can't be achieved with dtplyr? Thanks!

markfairbanks commented 2 years ago

Perhaps an alternative would be to automatically call as.data.table() when encountering where()?

Doing something like this would cause issues when users are expecting a lazy chain to continue but it suddenly evaluates. So this won't be possible to do unfortunately.

This might have been doable before https://github.com/tidyverse/dtplyr/pull/372, but we no longer automatically convert a data.table object to a lazy_dt() - it was causing too many problems (see https://github.com/tidyverse/dtplyr/issues/312).

Also, at this point are you aware of other things that can be done in e.g. tidytable that can't be achieved with dtplyr?

Here are a few examples. I don't think the full list is that big though.

sergiocorreia commented 2 years ago

I see, thanks a lot for the detailed explanation, as well as the other pointers!