tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

Implement `pick()` and use `across()` output as a data frame #341

Closed markfairbanks closed 2 years ago

markfairbanks commented 2 years ago
library(dplyr, warn.conflicts = FALSE)

df <- tibble(x = c("a", "a", "b"), y = 1:3, z = 1:3)

df %>%
  mutate(row_sum = rowSums(across(c(y, z))))
#> # A tibble: 3 × 4
#>   x         y     z row_sum
#>   <chr> <int> <int>   <dbl>
#> 1 a         1     1       2
#> 2 a         2     2       4
#> 3 b         3     3       6
guersam commented 2 years ago

Using lazy_dt produces this error:

Error in rowSums(list(y = y, z = z)) : 
  'x' must be an array of at least two dimensions
markfairbanks commented 2 years ago

Not sure the best way to handle this one. At first I thought the way to solve this would be by checking if an input dot was an across call. If not, capture_dots() would convert the across() to a call to a data.table() call.

Simple example:

dots_fn <- function(...) {
  dots <- enquos(...)
  is_top_across <- map(dots, quo_is_call, "across")
  is_top_across
}

This seems reasonable and works well in the following situations:

df %>%
  mutate(
    # "top" across call treated normally
    across(c(x, y), ~ .x + 1),
    # nested within another call - treated as a `data.table()` call
    row_sum = rowSums(across(c(y, z)))
  )

However I'm finding this approach can fail when dealing with operations applied to list-columns:

library(tidyverse)

list_df <- tibble(a = 1:3, b = 1:3)

df <- tibble(x = c("a", "b"), list_col = list(list_df, list_df))

df %>%
  mutate(
    list_col = map(list_col, ~ .x %>% mutate(across(c(a, b), ~ .x + 1)))
  )

Here we would want the map(mutate()) to treat the across() as a "top" across(). But the outermost mutate() would have already converted it to a call to data.table().

I haven't figured out a good workaround yet.

markfairbanks commented 2 years ago

Actually all we would need to do is catch if arrange/filter/mutate/slice/summarise/summarize are called inside of the dot and then return that part of the call unaltered.

markfairbanks commented 2 years ago

Updates from dplyr

Implement pick(): https://github.com/tidyverse/dplyr/pull/6492 Deprecate across(.fns = NULL): https://github.com/tidyverse/dplyr/pull/6523

So we need to implement pick(). Should we just skip implementing the across() version since it's being deprecated? Or do we allow it for now since this will probably work in dplyr for a while?

eutwt commented 2 years ago

I think implementing pick() is a separate issue from this one, because across output can still be used as a data frame in dplyr even with a non-NULL .fns

library(dplyr, warn.conflicts = FALSE)

df <- tibble(x = c("a", "a", "b"), y = 1:3, z = 1:3)

df %>%
  mutate(row_sum = rowSums(across(c(y, z), sin)))
#> # A tibble: 3 × 4
#>   x         y     z row_sum
#>   <chr> <int> <int>   <dbl>
#> 1 a         1     1   1.68 
#> 2 a         2     2   1.82 
#> 3 b         3     3   0.282

Created on 2022-11-03 with reprex v2.0.2

But, I don't see a good way to get this behavior in dtplyr. I'd say this issue should be closed unless you feel differently (and open a new one for pick())

eutwt commented 2 years ago

Oh, didn't see your earlier comments. Seems you have an idea for how to implement, so disregard my comment :)