Support data.frame cross-join

Darxor commented 1 year ago

Currently {dplyr} supports cross-joins of data.frames through the use of by = character(), and its reflected in the manual as a feature. {tidytable} does not have such feature yet, which could be problematic when using it as a drop-in replacement for {dplyr}. See code below for details.

The problem is {data.table} does not natively support cross-joins yet (https://github.com/Rdatatable/data.table/issues/1717, https://github.com/Rdatatable/data.table/pull/4544), so this would require full implementation. Probably https://github.com/Rdatatable/data.table/issues/1717#issuecomment-545758165 could be copied over if given permission.

Alternatively, if its best to wait for {data.table} to merge this feature, {tidytable} could raise a more meaningful error, when trying to use cross-join.

df1 <- data.frame(a = letters[1:2])
df2 <- data.frame(b = letters[3:4])

# all result in a same df
dplyr::left_join(df1, df2, by = character())
#>   a b
#> 1 a c
#> 2 a d
#> 3 b c
#> 4 b d
dplyr::right_join(df1, df2, by = character())
#>   a b
#> 1 a c
#> 2 a d
#> 3 b c
#> 4 b d
dplyr::inner_join(df1, df2, by = character())
#>   a b
#> 1 a c
#> 2 a d
#> 3 b c
#> 4 b d
dplyr::full_join(df1, df2, by = character())
#>   a b
#> 1 a c
#> 2 a d
#> 3 b c
#> 4 b d

# produces an error
tidytable::left_join(df1, df2, by = character())
#> Error in .parse_on(substitute(on), isnull_inames): 'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.
tidytable::right_join(df1, df2, by = character())
#> Error in .parse_on(substitute(on), isnull_inames): 'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.
tidytable::inner_join(df1, df2, by = character())
#> Error in .parse_on(substitute(on), isnull_inames): 'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.
# produces a different error
tidytable::full_join(df1, df2, by = character())
#> Error in merge.data.table(x = x, y = y, by.x = by$x, by.y = by$y, suffixes = suffix, : A non-empty vector of column names is required for `by.x` and `by.y`.

# Special cases:
# zero-length in anti-join
dplyr::anti_join(df1, df2, by = character())
#> [1] a
#> <0 rows> (or 0-length row.names)
# original x in semi_join
dplyr::semi_join(df1, df2, by = character())
#>   a
#> 1 a
#> 2 b

# produces an error, same as left_join
tidytable::anti_join(df1, df2, by = character())
#> Error in .parse_on(substitute(on), isnull_inames): 'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.
tidytable::semi_join(df1, df2, by = character())
#> Error in .parse_on(substitute(on), isnull_inames): 'on' argument should be a named atomic vector of column names indicating which columns in 'i' should be joined with which columns in 'x'.

^{Created on 2022-10-17 with reprex v2.0.2}