Rcpp fun ? - Githubissues

moodymudskipper commented 5 years ago

fuzzyjoin::interval_join doesn't do a cartesian product.

Other operations that can conveniently be translated in C++ don't need it either.

And I believe it includes all of our shortcut match_dist* functions as they are well defined cases of fuzzy join with specific functions applied on specified variables. This includes the match_closest* functions.

We can have efficient embeded for loops and keep only relevant pairs as we go (could be good for match_closest* too). We could have match_superior or match_inferior as shortcuts but it's not very clear which side should be superior or inferior so we'd need better names.

Even the general match_fuzzy could have a cpp = FALSE argument, variables are still registered either explicitly or with X / Y and then the call will be used almost as is in C++

moodymudskipper commented 5 years ago

For the last point it's important to note that if we want to feed c++ code to our function we'll need to compile a function on the go.

So the types of registered variables will need to be given a type in R, unique sets of those will be extracted from those, then we can pass these DF to the c++ codes where variables will be assigned to vectors so something like match_fuzzy(X("a") + X("b") < Y("c"), cpp = TRUE) should be translated in a loop to someting like : temp_lgl = a[i] + b[i] < c[j] after a, b and c have been created from input. If it's TRUE a row with i and j will be added to the matrix that will be returned by the index_fun that will wrap the whole thing.

We'll name variables with a prefix to be safe regarding reserved words.

We're growing a matrix without knowing it's final size, not sure what's a good way to do this in C++, seems like we must give an allocation anyway to be efficient, so I guess we could do it by chunk of say 100k values, and use https://stackoverflow.com/questions/17991083/dynamically-increase-size-of-list-in-rcpp

Can we work with explicitly registered variables only ? I'm not sure, the reason why X and Y were created is because we can't detect .x$A AND .x["A"] AND .x[["A"] AND {temp <- "A"; .x[temp]} etc... X and Y force a way to register because they couldn't be detected.

In R when we register explicitly the expression .x$A (for instance) will be sorted out but we can't translate R to C++ automatically, so I think using X / Y is the simplest.

We could support a limited amount of syntaxes, including .x$A , .x[["A"]] and .x[[1]] We'd need to check if :

.x or .y are used other than as first arguments of [[ or $ and return error otherwise
second arguments of [[ or $ when the first is used should be either a raw string, a raw integer, or we evaluate this argument beforehand (needs to be constant for full operation) to see what variable should be registered and how the subcall should be substituted in cpp code.

That might be used for the non cpp situation too but it's likely to be even more bug inducing, easier with X and Y !

moodymudskipper commented 5 years ago

This should be implemented, then we should pick an big example of interval_join and express it in term of regular fuzzy_join, using regular safe_join (should be the same but who knows), using interval_join (IRanges optimization), and using data.table. Then benchmark and if our results are not good enough, SO + fat bounty.

moodymudskipper commented 5 years ago

relevant: https://channel9.msdn.com/events/useR-international-R-User-conference/useR2016/Efficient-in-memory-non-equi-joins-using-datatable

moodymudskipper commented 5 years ago

for non equi join it actually makes sense to sort and use binary search as Arun does, but this is a special case, it seems data.table doesn't support complex fuzzy joins. IRanges is cited so it seems we're hanging out in the right spheres.

We might need a special case for non equi joins as they can be optimised further (maybe we can just forward them to data.table).

moodymudskipper commented 5 years ago

Will most likely never be tackled

moodymudskipper / safejoin

Rcpp fun ? #35