joins - Githubissues

moodymudskipper commented 4 years ago

as we don't allow unlabelled input in ... so far (unless spliced), these could be used for joins. However we need to be able to specify the joining columns, which we might to after a ~.

df1[,, df2]
df1[,, df2 ~ c(by_col1 = "by_col2")]

It works well for joins as they're usually performed, but if you want to join only certain columns you must repeat the by columns :

df1[,, df2[c("by_col2", "some_col")] ~ c(by_col1 = "by_col2")]

I think this is awkward. A way would be to use another ~ separator in that case :

df1[,, df2 ~ "some_col" ~ c(by_col1 = "by_col2")] # taking "some_col" from df2 
df1[,, df2 ~~ c(by_col1 = "by_col2")] # taking all from df2 
df1[,, df2 ~ "some_col"] # natural left join taking "some_col" from df2 
df1[,, df2] # natural left join taking all from df2

This join would give a nested result however if the join is not one on one, because our rule is that we don't change the number or rows if we mutate, so we need yet another ~ to expand , and in the worst case we get something like :

df1[,, ~df2 ~ "some_col" ~ c(by_col1 = "by_col2")]

which is the equivalent of :

#base
merge(df1, df2[c("by_col2", "some_col")], by.x = "by_col1", by.y = "by_col2", all.x = TRUE))
# dplyr
left_join(df1, df2[c("by_col2", "some_col")], by = c(by_col1 = "by_col2"))
# data.table
df2[df1, .(by_col2, some_col), on = .(by_col2 = "by_col1")]

Alternately we could have the by argument of the join in an "on" clause as in data.table but it means only one join per call, havingthe argument further from its use and adding an argument that is not really necessary.

moodymudskipper commented 4 years ago

Or just have a special j() function which creates a tb_join_specification object which basically will trigger a call tosafejoin::eat() so j() behaves like eat() without the first arg. This way we'll get all additional features and limit the weird syntax and reparsing, but we add a special function and add brackets.

The last example would become :

df1[,, j(df2, some_col, by = c(by_col1 = "by_col2"))]
# rather than
df1[,, ~df2 ~ "some_col" ~ c(by_col1 = "by_col2")]

Another thought :

Maybe trying to cram complex joins in tb is not the right approach as a complex join probably deserves its explicit specific call and we're probably not going to be better than dplyr, fuzzyjoin, data.table and safejoin combined.

What about being more limiting/ do something different instead ? The semi join in i would fail if subsetting several times the same row, the joins in ... would fail if duplicating entries of the lhs, and we could aggregate on the fly as done in safejoin.

df2 would be aggregated by by columns and an optional aggregation would be operated (which could be list if we want to keep the duplicates to expand after)

moodymudskipper commented 4 years ago

Another idea, a bit crazy but might be intuitive in practice :

we automatically give tb subsetting to data.frames given in ..., which means we don't evaluate right away
in practice it means we allow unnamed arguments if they are data frames or calls to [ with data frames as first args
we add some arguments to [ relative to the join

These arguments would be :

on (default to common columns by default)
conflict (as in safejoin, function to deal with conflicting columns)
agg (as in safejoin, a function to deal with duplicate matches, using list will be equivalent to a nest_join )

We'd have the benefit of doing fast reshaping with tb syntax AND we get to keep data frame at the front and arguments behind.

band_members %>% left_join(band_instruments)
band_members %tb>% .[,,band_instruments]

band_members %>% left_join(band_instruments, by = "name")
band_members %tb>% .[,,band_instruments[on = "name"]]

band_members %>% left_join(band_instruments2, by = c("name" = "artist"))
band_members %tb>% .[,,band_instruments[on = c("name" = "artist")]]
band_members %tb>% .[,,band_instruments[{artist} := "name"]] # equivalent

It's harder to spot back later that it's a join though, while it's clear enough with j .

We can have an hybrid approach :

if no j but data frame input : natural left join , we can wrap in j(df) if we just want to make it explicit
using j : we give [.tb to its first arg (with no additional args)
no need for eat's ... as we have better subsetting in [.tb
but we can include other params from eat

Could be lj rather than j to make it a bit clearer that its "left"

moodymudskipper / tb

joins #16