Closed Darxor closed 1 year ago
Yep we can add this to the joins. expand_grid()
actually works on data frames so this functionality is already there for us to use internally. It also seems to perform well compared to the options from the comment you linked.
pacman::p_load(data.table, tidytable)
CJ.dt_2 <- function(...) {
DTls <- list(...)
rows <- do.call(CJ, lapply(DTls, function(x) x[, seq_len(.N)]))
res <- DTls[[1L]][rows[[1L]]]
for (n in seq_along(DTls)[-1L])
res <- res[, c(.SD, DTls[[n]][rows[[n]]])]
res
}
CJDT <- function(...) {
out <- Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))
as_tidytable(out)
}
df1 <- tidytable(a = stringi::stri_rand_strings(10000, 4),
b = round(runif(10000), 2))
df2 <- tidytable(c = round(runif(1000), 2))
# Verify outputs are the same (given the same sorting)
all.equal(
CJDT(df1, df2) %>% arrange(across(everything())),
expand_grid(df1, df2) %>% arrange(across(everything()))
)
#> [1] TRUE
bench::mark(
CJ.dt_2 = CJ.dt_2(df1, df2),
CJDT = CJDT(df1, df2),
expand_grid = expand_grid(df1, df2),
check = FALSE, iterations = 30
) %>%
select(expression, median, mem_alloc)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tidytable: 3 × 3
#> expression median mem_alloc
#> <bch:expr> <bch:tm> <bch:byt>
#> 1 CJ.dt_2 342ms 687MB
#> 2 CJDT 220ms 343MB
#> 3 expand_grid 123ms 305MB
@Darxor is there any functional difference between the different types of joins when doing a cross join? Or are they all always the same?
Oh, I didn't even notice expand_grid()
!
@Darxor is there any functional difference between the different types of joins when doing a cross join? Or are they all always the same?
As far as I've tested - left, right, full, inner all behave the same way down to column / row order.
anti produces a zero-length data.frame, semi just returns the value of x.
nest_join()
produces a data.frame with columns from x with list-column containing copies of y in each row
Gotcha, thanks for testing all of the options 😄
And now that you mention it, nest_join()
should probably be implemented. I'll open a separate issue for that.
Edit: nest_join()
is now implemented - #657
All set.
pacman::p_load(tidytable)
df1 <- data.frame(a = letters[1:2])
df2 <- data.frame(b = letters[3:4])
left_join(df1, df2, character())
#> # A tidytable: 4 × 2
#> a b
#> <chr> <chr>
#> 1 a c
#> 2 a d
#> 3 b c
#> 4 b d
And a speed test.
pacman::p_load(tidytable, stringi)
df1 <- tidytable(a = stri_rand_strings(10000, 4),
b = round(runif(10000), 2))
df2 <- tidytable(c = round(runif(1000), 2))
bench::mark(
dplyr = dplyr::left_join(df1, df2, character()),
tidytable = left_join(df1, df2, character()),
check = FALSE, iterations = 30
) %>%
select(expression, median, mem_alloc)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tidytable: 2 × 3
#> expression median mem_alloc
#> <bch:expr> <bch:tm> <bch:byt>
#> 1 dplyr 270ms 315MB
#> 2 tidytable 202ms 305MB
Currently
{dplyr}
supports cross-joins of data.frames through the use ofby = character()
, and its reflected in the manual as a feature.{tidytable}
does not have such feature yet, which could be problematic when using it as a drop-in replacement for{dplyr}
. See code below for details.The problem is
{data.table}
does not natively support cross-joins yet (https://github.com/Rdatatable/data.table/issues/1717, https://github.com/Rdatatable/data.table/pull/4544), so this would require full implementation. Probably https://github.com/Rdatatable/data.table/issues/1717#issuecomment-545758165 could be copied over if given permission.Alternatively, if its best to wait for
{data.table}
to merge this feature,{tidytable}
could raise a more meaningful error, when trying to use cross-join.Created on 2022-10-17 with reprex v2.0.2