tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

Issue with arrange with desc and .data pronoun - Can't subset `.data` outside of a data mask context #346

Closed joaofgoncalves closed 2 years ago

joaofgoncalves commented 2 years ago

Hello, I came across some potentially strange behaviour of arrange when using the desc function with the .data pronoun. The line below is throwing the following error: "Error: Can't subset .data outside of a data mask context."

arrange(.data$SID, desc(.data$prop))

If desc() is not used within arrange() it works fine with the .data pronoun. Also, this issue only happens when using dtplyr but is working fine with dplyr. Here is a reproducible example:

library(dtplyr)
library(dplyr)

x <- data.frame(SID = 1:100, 
                train = sample(c(0,1), 1000, replace = TRUE))

x <- dtplyr::lazy_dt(x, key_by = "SID")

# KO - does not work and generates:
# Error: Can't subset `.data` outside of a data mask context.

tb <- x %>% 
  group_by(.data$SID, .data$train) %>% 
  summarize(prop = n()) %>% 
  mutate(rs = sum(.data$prop)) %>% 
  ungroup()%>% 
  mutate(prop = .data$prop / .data$rs) %>% 
  select(-.data$rs) %>% 
  filter(!is.na(.data$train)) %>% 
  filter(.data$prop >= 0.5) %>% 
  group_by(.data$SID) %>% 
  arrange(.data$SID, desc(.data$prop)) %>% 
  slice(1) %>% 
  ungroup() %>% 
  select(-.data$prop) %>% 
  as.data.frame()

# OK - runs fine if the .data pronoun is removed 

tb <- x %>% 
  group_by(.data$SID, .data$train) %>% 
  summarize(prop = n()) %>% 
  mutate(rs = sum(.data$prop)) %>% 
  ungroup()%>% 
  mutate(prop = .data$prop / .data$rs) %>% 
  select(-.data$rs) %>% 
  filter(!is.na(.data$train)) %>% 
  filter(.data$prop >= 0.5) %>% 
  group_by(.data$SID) %>% 
  arrange(.data$SID, desc(prop)) %>% 
  slice(1) %>% 
  ungroup() %>% 
  select(-.data$prop) %>% 
  as.data.frame()

# Also OK to use the .data pronoun but not the desc() function

tb <- x %>% 
  group_by(.data$SID, .data$train) %>% 
  summarize(prop = n()) %>% 
  mutate(rs = sum(.data$prop)) %>% 
  ungroup()%>% 
  mutate(prop = .data$prop / .data$rs) %>% 
  select(-.data$rs) %>% 
  filter(!is.na(.data$train)) %>% 
  filter(.data$prop >= 0.5) %>% 
  group_by(.data$SID) %>% 
  arrange(.data$SID, .data$prop) %>% 
  slice(1) %>% 
  ungroup() %>% 
  select(-.data$prop) %>% 
  as.data.frame()

# Also works fine when not using dtplyr

y <- data.frame(SID = 1:100, 
                train = sample(c(0,1), 1000, replace = TRUE))

tb <- y %>% 
  group_by(.data$SID, .data$train) %>% 
  summarize(prop = n()) %>% 
  mutate(rs = sum(.data$prop)) %>% 
  ungroup()%>% 
  mutate(prop = .data$prop / .data$rs) %>% 
  select(-.data$rs) %>% 
  filter(!is.na(.data$train)) %>% 
  filter(.data$prop >= 0.5) %>% 
  group_by(.data$SID) %>% 
  arrange(.data$SID, desc(.data$prop)) %>% 
  slice(1) %>%
  ungroup() %>% 
  select(-.data$prop) %>% 
  as.data.frame()

Session info follows below:

print(sessionInfo())
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Portugal.1252  LC_CTYPE=Portuguese_Portugal.1252   
[3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C                        
[5] LC_TIME=Portuguese_Portugal.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rlang_1.0.1       dplyr_1.0.8       dtplyr_1.2.1.9000

loaded via a namespace (and not attached):
 [1] fansi_1.0.2           assertthat_0.2.1      utf8_1.2.2            crayon_1.5.0         
 [5] R6_2.5.1              DBI_1.1.2             lifecycle_1.0.1       magrittr_2.0.2       
 [9] pillar_1.7.0          cli_3.2.0             rstudioapi_0.13       data.table_1.14.2    
[13] vctrs_0.3.8.9001      generics_0.1.2        ellipsis_0.3.2        tools_4.0.2          
[17] glue_1.6.1            purrr_0.3.4           compiler_4.0.2        pkgconfig_2.0.3      
[21] tidyselect_1.1.1.9000 tibble_3.1.6     

Thanks in advance!

markfairbanks commented 2 years ago

Smaller reprex:

library(dplyr, warn.conflicts = FALSE)
library(dtplyr)

df <- lazy_dt(tibble(x = 1:3))

df %>%
  arrange(desc(.data$x))
#> Error: Can't subset `.data` outside of a data mask context.
markfairbanks commented 2 years ago

@joaofgoncalves - all fixed, thanks for catching this.

# devtools::install_github("tidyverse/dtplyr")
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)

df <- lazy_dt(tibble(x = 1:3))

df %>%
  arrange(desc(.data$x))
#> Source: local data table [3 x 1]
#> Call:   `_DT1`[order(-x)]
#> 
#>       x
#>   <int>
#> 1     3
#> 2     2
#> 3     1
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results