markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

Add `.unpack` argument to `summarise()` #576

Closed psychelzh closed 2 years ago

psychelzh commented 2 years ago

See the following example:

library(tidytable)
#> 
#> Attaching package: 'tidytable'
#> The following object is masked from 'package:stats':
#> 
#>     dt
#> The following object is masked from 'package:base':
#> 
#>     %in%
dat <- tidytable(g = rep(c("A", "B"), 100)) |> 
    mutate.(val = rnorm(n.()))
fun <- function(x) {
    data.frame(mean = mean(x), max = max(x))
}
dat |> 
    summarise.(fun(val), .by = g)
#> Error in `[.data.table`(~.df, , .(fun(val)), keyby = "g"): All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.
dat |> 
    dplyr::group_by(g) |> 
    dplyr::summarise(fun(val), .groups = "drop")
#> # A tibble: 2 × 3
#>   g         mean   max
#>   <chr>    <dbl> <dbl>
#> 1 A     -0.00809  2.73
#> 2 B      0.230    2.47

Created on 2022-08-29 with reprex v2.0.2

markfairbanks commented 2 years ago

I'm a bit surprised this works without .by to be honest. I'll take a look and see what I can figure out.

markfairbanks commented 2 years ago

@psychelzh unfortunately I won't be able to implement this in tidytable. I can get it to work pretty easily by utilizing vctrs::df_list() inside of data.table's summarize. However there is a pretty large performance loss when it is implemented.

markfairbanks commented 2 years ago

Here are the benchmarks:

pacman::p_load(tidytable, data.table, vctrs)
setDTthreads(4)
set.seed(123)

random_strings <- c('OoVt', 'wCbu', 'cXxX', 'jdFu', 'MCRx',
                    'ukhz', 'ikce', 'PHyu', 'jpBY', 'nLQM')

data_size <- 10000000

df <- tidytable(a = as.double(sample(1:20, data_size, TRUE)),
                b = as.double(sample(1:20, data_size, TRUE)),
                c = sample(random_strings, data_size, TRUE),
                d = sample(random_strings, data_size, TRUE))

bench::mark(
  vctrs_df_list = df[, vctrs::df_list(avg_a = mean(a), med_b = median(b)), by = .(c, d)],
  normal_list = df[, list(avg_a = mean(a), med_b = median(b)), by = .(c, d)],
  check = FALSE, iterations = 11
) %>%
  select(expression, min, median)
#> # A tidytable: 2 × 3
#>   expression         min   median
#>   <bch:expr>    <bch:tm> <bch:tm>
#> 1 vctrs_df_list    675ms    681ms
#> 2 normal_list      534ms    550ms

As I dig into this more - it looks like data.table's GForce optimizations are turned off if you don't use list().

psychelzh commented 2 years ago

Got it! How about implement it anyway, but document its possible performance loss?

markfairbanks commented 2 years ago

Hmm I'm not sure I want to add a feature that could cause this type of performance loss. Especially since it's an edge case feature.

Maybe I could add an argument that allows you to opt-in to this? It would be different than dplyr, but at least the functionality would exist. Something like .unpack = TRUE?

dat |> 
  summarise.(fun(val), .by = g, .unpack = TRUE)
psychelzh commented 2 years ago

It is reasonable. Opt-in is what I meant by my last comment.

markfairbanks commented 2 years ago

All set.

library(tidytable, warn.conflicts = FALSE)
#> As of tidytable v0.9.0 dotless versions of functions are exported.
#> You can now use `arrange()`/`mutate()`/etc. directly.

df <- tidytable(group = c("a", "a", "b"), val = 1:3)

fun <- function(x) {
  data.frame(mean = mean(x), max = max(x))
}

df %>%
  summarise(fun(val), .by = group, .unpack = TRUE)
#> # A tidytable: 2 × 3
#>   group  mean   max
#>   <chr> <dbl> <int>
#> 1 a       1.5     2
#> 2 b       3       3