moodymudskipper / safejoin

Wrappers around dplyr functions to join safely using various checks
GNU General Public License v3.0
42 stars 7 forks source link

Rcpp fun ? #35

Closed moodymudskipper closed 5 years ago

moodymudskipper commented 5 years ago

fuzzyjoin::interval_join doesn't do a cartesian product.

Other operations that can conveniently be translated in C++ don't need it either.

And I believe it includes all of our shortcut match_dist* functions as they are well defined cases of fuzzy join with specific functions applied on specified variables. This includes the match_closest* functions.

We can have efficient embeded for loops and keep only relevant pairs as we go (could be good for match_closest* too). We could have match_superior or match_inferior as shortcuts but it's not very clear which side should be superior or inferior so we'd need better names.

Even the general match_fuzzy could have a cpp = FALSE argument, variables are still registered either explicitly or with X / Y and then the call will be used almost as is in C++

moodymudskipper commented 5 years ago

For the last point it's important to note that if we want to feed c++ code to our function we'll need to compile a function on the go.

So the types of registered variables will need to be given a type in R, unique sets of those will be extracted from those, then we can pass these DF to the c++ codes where variables will be assigned to vectors so something like match_fuzzy(X("a") + X("b") < Y("c"), cpp = TRUE) should be translated in a loop to someting like : temp_lgl = a[i] + b[i] < c[j] after a, b and c have been created from input. If it's TRUE a row with i and j will be added to the matrix that will be returned by the index_fun that will wrap the whole thing.

We'll name variables with a prefix to be safe regarding reserved words.

We're growing a matrix without knowing it's final size, not sure what's a good way to do this in C++, seems like we must give an allocation anyway to be efficient, so I guess we could do it by chunk of say 100k values, and use https://stackoverflow.com/questions/17991083/dynamically-increase-size-of-list-in-rcpp

Can we work with explicitly registered variables only ? I'm not sure, the reason why X and Y were created is because we can't detect .x$A AND .x["A"] AND .x[["A"] AND {temp <- "A"; .x[temp]} etc... X and Y force a way to register because they couldn't be detected.

In R when we register explicitly the expression .x$A (for instance) will be sorted out but we can't translate R to C++ automatically, so I think using X / Y is the simplest.

We could support a limited amount of syntaxes, including .x$A , .x[["A"]] and .x[[1]] We'd need to check if :

That might be used for the non cpp situation too but it's likely to be even more bug inducing, easier with X and Y !

moodymudskipper commented 5 years ago

This should be implemented, then we should pick an big example of interval_join and express it in term of regular fuzzy_join, using regular safe_join (should be the same but who knows), using interval_join (IRanges optimization), and using data.table. Then benchmark and if our results are not good enough, SO + fat bounty.

moodymudskipper commented 5 years ago

relevant: https://channel9.msdn.com/events/useR-international-R-User-conference/useR2016/Efficient-in-memory-non-equi-joins-using-datatable

moodymudskipper commented 5 years ago

for non equi join it actually makes sense to sort and use binary search as Arun does, but this is a special case, it seems data.table doesn't support complex fuzzy joins. IRanges is cited so it seems we're hanging out in the right spheres.

We might need a special case for non equi joins as they can be optimised further (maybe we can just forward them to data.table).

moodymudskipper commented 5 years ago

Will most likely never be tackled