Closed moodymudskipper closed 5 years ago
For the last point it's important to note that if we want to feed c++
code to our function we'll need to compile a function on the go.
So the types of registered variables will need to be given a type in R, unique sets of those will be extracted from those, then we can pass these DF to the c++ codes where variables will be assigned to vectors so something like match_fuzzy(X("a") + X("b") < Y("c"), cpp = TRUE)
should be translated in a loop to someting like :
temp_lgl = a[i] + b[i] < c[j]
after a
, b
and c
have been created from input.
If it's TRUE a row with i
and j
will be added to the matrix that will be returned by the index_fun
that will wrap the whole thing.
We'll name variables with a prefix to be safe regarding reserved words.
We're growing a matrix without knowing it's final size, not sure what's a good way to do this in C++, seems like we must give an allocation anyway to be efficient, so I guess we could do it by chunk of say 100k values, and use https://stackoverflow.com/questions/17991083/dynamically-increase-size-of-list-in-rcpp
Can we work with explicitly registered variables only ? I'm not sure, the reason why X
and Y
were created is because we can't detect .x$A
AND .x["A"]
AND .x[["A"]
AND {temp <- "A"; .x[temp]}
etc...
X
and Y
force a way to register because they couldn't be detected.
In R
when we register explicitly the expression .x$A
(for instance) will be sorted out but we can't translate R
to C++
automatically, so I think using X
/ Y
is the simplest.
We could support a limited amount of syntaxes, including .x$A
, .x[["A"]]
and .x[[1]]
We'd need to check if :
.x
or .y
are used other than as first arguments of [[
or $
and return error otherwise[[
or $
when the first is used should be either a raw string, a raw integer, or we evaluate this argument beforehand (needs to be constant for full operation) to see what variable should be registered and how the subcall should be substituted in cpp code.That might be used for the non cpp situation too but it's likely to be even more bug inducing, easier with X
and Y
!
This should be implemented, then we should pick an big example of interval_join
and express it in term of regular fuzzy_join
, using regular safe_join
(should be the same but who knows), using interval_join
(IRanges
optimization), and using data.table. Then benchmark and if our results are not good enough, SO + fat bounty.
for non equi join it actually makes sense to sort and use binary search as Arun does, but this is a special case, it seems data.table doesn't support complex fuzzy joins. IRanges
is cited so it seems we're hanging out in the right spheres.
We might need a special case for non equi joins as they can be optimised further (maybe we can just forward them to data.table).
Will most likely never be tackled
fuzzyjoin::interval_join
doesn't do a cartesian product.Other operations that can conveniently be translated in
C++
don't need it either.And I believe it includes all of our shortcut
match_dist*
functions as they are well defined cases of fuzzy join with specific functions applied on specified variables. This includes thematch_closest*
functions.We can have efficient embeded for loops and keep only relevant pairs as we go (could be good for
match_closest*
too). We could havematch_superior
ormatch_inferior
as shortcuts but it's not very clear which side should be superior or inferior so we'd need better names.Even the general
match_fuzzy
could have acpp = FALSE
argument, variables are still registered either explicitly or withX
/Y
and then the call will be used almost as is in C++