Closed psychelzh closed 2 years ago
I'm a bit surprised this works without .by
to be honest. I'll take a look and see what I can figure out.
@psychelzh unfortunately I won't be able to implement this in tidytable
. I can get it to work pretty easily by utilizing vctrs::df_list()
inside of data.table's summarize. However there is a pretty large performance loss when it is implemented.
Here are the benchmarks:
pacman::p_load(tidytable, data.table, vctrs)
setDTthreads(4)
set.seed(123)
random_strings <- c('OoVt', 'wCbu', 'cXxX', 'jdFu', 'MCRx',
'ukhz', 'ikce', 'PHyu', 'jpBY', 'nLQM')
data_size <- 10000000
df <- tidytable(a = as.double(sample(1:20, data_size, TRUE)),
b = as.double(sample(1:20, data_size, TRUE)),
c = sample(random_strings, data_size, TRUE),
d = sample(random_strings, data_size, TRUE))
bench::mark(
vctrs_df_list = df[, vctrs::df_list(avg_a = mean(a), med_b = median(b)), by = .(c, d)],
normal_list = df[, list(avg_a = mean(a), med_b = median(b)), by = .(c, d)],
check = FALSE, iterations = 11
) %>%
select(expression, min, median)
#> # A tidytable: 2 × 3
#> expression min median
#> <bch:expr> <bch:tm> <bch:tm>
#> 1 vctrs_df_list 675ms 681ms
#> 2 normal_list 534ms 550ms
As I dig into this more - it looks like data.table
's GForce optimizations are turned off if you don't use list()
.
Got it! How about implement it anyway, but document its possible performance loss?
Hmm I'm not sure I want to add a feature that could cause this type of performance loss. Especially since it's an edge case feature.
Maybe I could add an argument that allows you to opt-in to this? It would be different than dplyr, but at least the functionality would exist. Something like .unpack = TRUE
?
dat |>
summarise.(fun(val), .by = g, .unpack = TRUE)
It is reasonable. Opt-in is what I meant by my last comment.
All set.
library(tidytable, warn.conflicts = FALSE)
#> As of tidytable v0.9.0 dotless versions of functions are exported.
#> You can now use `arrange()`/`mutate()`/etc. directly.
df <- tidytable(group = c("a", "a", "b"), val = 1:3)
fun <- function(x) {
data.frame(mean = mean(x), max = max(x))
}
df %>%
summarise(fun(val), .by = group, .unpack = TRUE)
#> # A tidytable: 2 × 3
#> group mean max
#> <chr> <dbl> <int>
#> 1 a 1.5 2
#> 2 b 3 3
See the following example:
Created on 2022-08-29 with reprex v2.0.2