markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
449 stars 33 forks source link

Support data.frame cross-join #655

Closed Darxor closed 1 year ago

Darxor commented 1 year ago

Currently {dplyr} supports cross-joins of data.frames through the use of by = character(), and its reflected in the manual as a feature. {tidytable} does not have such feature yet, which could be problematic when using it as a drop-in replacement for {dplyr}. See code below for details.

The problem is {data.table} does not natively support cross-joins yet (https://github.com/Rdatatable/data.table/issues/1717, https://github.com/Rdatatable/data.table/pull/4544), so this would require full implementation. Probably https://github.com/Rdatatable/data.table/issues/1717#issuecomment-545758165 could be copied over if given permission.

Alternatively, if its best to wait for {data.table} to merge this feature, {tidytable} could raise a more meaningful error, when trying to use cross-join.

df1 <- data.frame(a = letters[1:2])
df2 <- data.frame(b = letters[3:4])

# all result in a same df
dplyr::left_join(df1, df2, by = character())
#>   a b
#> 1 a c
#> 2 a d
#> 3 b c
#> 4 b d
dplyr::right_join(df1, df2, by = character())
#>   a b
#> 1 a c
#> 2 a d
#> 3 b c
#> 4 b d
dplyr::inner_join(df1, df2, by = character())
#>   a b
#> 1 a c
#> 2 a d
#> 3 b c
#> 4 b d
dplyr::full_join(df1, df2, by = character())
#>   a b
#> 1 a c
#> 2 a d
#> 3 b c
#> 4 b d

# produces an error
tidytable::left_join(df1, df2, by = character())
#> Error in .parse_on(substitute(on), isnull_inames): 'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.
tidytable::right_join(df1, df2, by = character())
#> Error in .parse_on(substitute(on), isnull_inames): 'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.
tidytable::inner_join(df1, df2, by = character())
#> Error in .parse_on(substitute(on), isnull_inames): 'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.
# produces a different error
tidytable::full_join(df1, df2, by = character())
#> Error in merge.data.table(x = x, y = y, by.x = by$x, by.y = by$y, suffixes = suffix, : A non-empty vector of column names is required for `by.x` and `by.y`.

# Special cases:
# zero-length in anti-join
dplyr::anti_join(df1, df2, by = character())
#> [1] a
#> <0 rows> (or 0-length row.names)
# original x in semi_join
dplyr::semi_join(df1, df2, by = character())
#>   a
#> 1 a
#> 2 b

# produces an error, same as left_join
tidytable::anti_join(df1, df2, by = character())
#> Error in .parse_on(substitute(on), isnull_inames): 'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.
tidytable::semi_join(df1, df2, by = character())
#> Error in .parse_on(substitute(on), isnull_inames): 'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.

Created on 2022-10-17 with reprex v2.0.2

markfairbanks commented 1 year ago

Yep we can add this to the joins. expand_grid() actually works on data frames so this functionality is already there for us to use internally. It also seems to perform well compared to the options from the comment you linked.

pacman::p_load(data.table, tidytable)

CJ.dt_2 <- function(...) {
    DTls <- list(...)
    rows <- do.call(CJ, lapply(DTls, function(x) x[, seq_len(.N)]))
    res <- DTls[[1L]][rows[[1L]]]
    for (n in seq_along(DTls)[-1L])
        res <- res[, c(.SD, DTls[[n]][rows[[n]]])]
    res
}

CJDT <- function(...) {
  out <- Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))
  as_tidytable(out)
}

df1 <- tidytable(a = stringi::stri_rand_strings(10000, 4),
                 b = round(runif(10000), 2))
df2 <- tidytable(c = round(runif(1000), 2))

# Verify outputs are the same (given the same sorting)
all.equal(
  CJDT(df1, df2) %>% arrange(across(everything())),
  expand_grid(df1, df2) %>% arrange(across(everything()))
)
#> [1] TRUE

bench::mark(
  CJ.dt_2 = CJ.dt_2(df1, df2),
  CJDT = CJDT(df1, df2),
  expand_grid = expand_grid(df1, df2),
  check = FALSE, iterations = 30
) %>%
  select(expression, median, mem_alloc)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tidytable: 3 × 3
#>   expression    median mem_alloc
#>   <bch:expr>  <bch:tm> <bch:byt>
#> 1 CJ.dt_2        342ms     687MB
#> 2 CJDT           220ms     343MB
#> 3 expand_grid    123ms     305MB
markfairbanks commented 1 year ago

@Darxor is there any functional difference between the different types of joins when doing a cross join? Or are they all always the same?

Darxor commented 1 year ago

Oh, I didn't even notice expand_grid()!

@Darxor is there any functional difference between the different types of joins when doing a cross join? Or are they all always the same?

As far as I've tested - left, right, full, inner all behave the same way down to column / row order. anti produces a zero-length data.frame, semi just returns the value of x. nest_join() produces a data.frame with columns from x with list-column containing copies of y in each row

markfairbanks commented 1 year ago

Gotcha, thanks for testing all of the options 😄

And now that you mention it, nest_join() should probably be implemented. I'll open a separate issue for that. Edit: nest_join() is now implemented - #657

markfairbanks commented 1 year ago

All set.

pacman::p_load(tidytable)

df1 <- data.frame(a = letters[1:2])
df2 <- data.frame(b = letters[3:4])

left_join(df1, df2, character())
#> # A tidytable: 4 × 2
#>   a     b    
#>   <chr> <chr>
#> 1 a     c    
#> 2 a     d    
#> 3 b     c    
#> 4 b     d
markfairbanks commented 1 year ago

And a speed test.

pacman::p_load(tidytable, stringi)

df1 <- tidytable(a = stri_rand_strings(10000, 4),
                 b = round(runif(10000), 2))
df2 <- tidytable(c = round(runif(1000), 2))

bench::mark(
  dplyr = dplyr::left_join(df1, df2, character()),
  tidytable = left_join(df1, df2, character()),
  check = FALSE, iterations = 30
) %>%
  select(expression, median, mem_alloc)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tidytable: 2 × 3
#>   expression   median mem_alloc
#>   <bch:expr> <bch:tm> <bch:byt>
#> 1 dplyr         270ms     315MB
#> 2 tidytable     202ms     305MB