Closed moodymudskipper closed 2 years ago
Having 2nd thoughts on names, maybe all of them should start with match
so they are verbs and easy to list with autocomplete. Still hesitating if match_dist
/ match_closest
need a _num
suffix.
~
to work in combination with standard matchesmatch_fuzzy
as an "almost alias" to "~" except for the rowwise
and ignore_case
params and that it supports !!
match_equal
building temp columns removed after joins, with same ignore_case
param, if computed columns are lists or data frames (and are of same type), use digest
on them. match_all_equal
: A match with all_equal
(not sensitive to column or row order) achieved with fuzzy_match(all_equal(a,b), rowwise = TRUE)
, so lists and data frames can be matched if different order of columns or rowsmatch_regex
as a wrapper around match_fuzzy
match_interval
without IRanges
first, have parameters x_min, x_max = x_min, y_min, y_max = y_min
match_dist
as a wrapper around match_fuzzy
with a max_value
parammatch_dist_str
similarmatch_dist_geo
similarmatch_closest
like match_dist
but keeping minimum dist onlymatch_closest_str
similarmatch_closest_geo
similarmatch_dist_gen
similarmatch_closest_gen
similarWe know what those do but not what they return, it could be just a formula for all the fuzzy matches, with optionally rowwise
as a lhs, it could also have a class and a printing method.
But what does match_equal
return ? It could be a "regular" exact match of the type temp_x = "temp_y"
with the definitions of temp_x
and temp_y
as attributes, it could have a class of its own. The challenge is that if we want these functions to have no side effect, their output needs to be dealt with downstream.
If we want to document them it's better to be able to define them out of the function. classes names ? fuzzy_matching_expression
and exact_matching_expression
?
match_all_equal
using all_equal
and a fuzzy join is not very efficient.
This feature should be supported by match_equal
, just using additional parameters such as ignore_element_order
or ignore_row_order
(by analogy to ignore_case
). Maybe ignore_class
to be able to merge lists, data frames and tibbles together (would use unclass). Objects would transformed by sorting or unclassing etc then digest::digest
would be used to join.
go back to this when it's done : https://github.com/moodymudskipper/safejoin/issues/31
To avoid being too magical, and have readable code, these functions should return objects of given classes. They're invisible to the user so it's ok if names are changed.
They all should return an object of class safe_join_match
, this object will have an additional class depending on the kind of magic we use.
safe_join_match_equal
for match_equal
safe_join_match_fuzzy
for match_fuzzy
and wrappers around itsafe_join_match_intervals
for match_interval
and wrappers around it (uses IRanges)
etc...I think we need to isolate the by_exact
and by_fuzzy
like in https://stackoverflow.com/questions/48008903/combined-fuzzy-and-exact-matching/55300322#55300322
except here we start from the .by
argument, and need to translate the quosures
and match_exact
objects to character, and handle the other types of matches.
In the end we need to remove temporary columns created to handle match_exact
, which means we also need to names those.
If there are fuzzy joins we need to do first semi joins on the exact variables (here semi means the regular join but without eating any variable), then we have several options:
&
. It will however apply some join formula on more combinations than necessary. and it excludes the non standard fuzzy joins (using IRanges or Rcpp for instance).The fun now happens at https://github.com/moodymudskipper/powerjoin
Sometimes we need to do a transformation before a join and drop it afterward, ignoring case is a special occurence of the use case.
For instance:
It can technically be done by "abusing" the fuzzy join features :
but we're doing a very costly cartesian product here.
The following interface would be intuitive :
Temp columns would be created, be used for the match then dropped.
Instead of a simple
~
the fuzzy match could work withmatch_fuzzy
so we have a consistent interface. This would allow additional parameters such asignore_case = FALSE
,rowwise = FALSE
,cartesian = FALSE
andmatch_vars = NULL
.match_vars
allows to register explicitly variables for the cartesian join, which can be done implicitly with theX
andY
functions. It also makes those easier to document and understand.ignore_case
does the matching on lower case versions of registered variablesrowwise
applies the matching by row, in case returning a boolean vector from vectors is painful (it either masks a call todplyr::rowwise()
orpurrr::pmap_lgl()
).We could implement a
max_value
anddist_col
argument to be explicit about cases where we measure a custom distance and want to add a column.The formula we have now becomes a shortcut to
match_fuzzy
with default arguments.This family of
match_*
functions can be extrapolated to support features from package fuzzyjoin which were offered as separate functions :difference
,distance
,genome
,geo
,regex
,interval
,stringdist
. Our system is more general as we can mix all sorts of these special by arguments (not an obvious feature though, as we wouldn't want to make all cartesian products in one go. The fact that we have all arguments inmatch_fuzzy
makes it convenient so they're all wrappers around it.Some features can be added, for example counterparts to distance matches functions (
match_dist*
) could bematch_closest*
. The counterpart tomatch_fuzzy
could bematch_smallest
.What
stringr
does : instead of a string users can use functionfixed()
orregex()
, these give a class to their output and then the function interprets it.So far the features we can have are :
match_equal
: the only one from the list which is NOT a fuzzy join, supports matching on transformed value, and matching on unsupported types by usingdigest::digest
on elements. (digest is imported by devtools, ggplot2, htmlTools, roxygen2... so pretty much everyone has it)match_all_equal
: a fuzzy join + use ofall_equal
match_fuzzy
match_smallest
match_dist
/match_closest
: includes features of bothfuzzyjoin::difference_join
andfuzzyjoin::distance_join
: they should be together, the first computes distances column by column, the second with all by variables at the same time, so to have the first feature we can just call it twice.match_dist_gen
/match_closest_gen
: handles features of`fuzzyjoin::genome_join
: doc says it's a twist ofinterval
join though it doesn't call it directlymatch_dist_geo
/match_closest_geo
: handles features of`fuzzyjoin::geo_join
match_regex
: This one is a bit weird as it treats differently x and y, the columns in y should contain regex. We can give it aregex_side = c("right","left")
argument to be general and make it easier to understand. pattern will be passed tostringr::regex
so we can have more flexibilitymatch_regex <- function(x, y , pattern, ignore_case = FALSE, multiline = FALSE, comments = FALSE, dotall = FALSE, regex_side = c("right","left"), ...){...}
.match_interval
: start and end are given on each side and rows that have an intersection will be joined, this could be translated by~ X("end") >= Y("start") & X("start") <= Y("end")
, fuzzyjoin uses packageIRanges
from bioconductor, which avoids the cartesian product, so we should use it as well. This should work with computed values, including constant values (which would be recycled).match_dist_str
/match_closest_str
: usingstringdist
features.Let's try to stick to consistent interfaces as much as possible, though it's not always possible. Some of the above must just translate their arguments to a formula, and
fuzzy
would indeed just be alias for~
if we don't implement rowwise.Fuzzy matches will have an argument
cartesian = TRUE
, whenFALSE
we iterate of the side with the leastby
groups for fuzzy matching, slower but makes sure RAM usage doesn't get out of control. Could come with a progress bar, or a progress bar argument.