match_* functions to use in `by` arg for different fuzzy or exact matches

moodymudskipper commented 5 years ago

Sometimes we need to do a transformation before a join and drop it afterward, ignoring case is a special occurence of the use case.

For instance:

df1 %>% 
  mutate(join_col_1 = fun(col)) %>% 
  left_join(df2, by = c(join_col_1 = "join_col_2") %>%
  select(-join_col_1)

It can technically be done by "abusing" the fuzzy join features :

df1 %>% 
  safe_left_join(df2, by = ~ fun(X("col")) == Y("join_col_2"))

but we're doing a very costly cartesian product here.

The following interface would be intuitive :

df1 %>% 
  safe_left_join(df2, by = match_equal(fun(col), join_col_2))

Temp columns would be created, be used for the match then dropped.

Instead of a simple ~ the fuzzy match could work with match_fuzzy so we have a consistent interface. This would allow additional parameters such as ignore_case = FALSE, rowwise = FALSE, cartesian = FALSE and match_vars = NULL.

match_vars allows to register explicitly variables for the cartesian join, which can be done implicitly with the X and Y functions. It also makes those easier to document and understand.
ignore_case does the matching on lower case versions of registered variables
rowwise applies the matching by row, in case returning a boolean vector from vectors is painful (it either masks a call to dplyr::rowwise() or purrr::pmap_lgl()).

We could implement a max_value and dist_col argument to be explicit about cases where we measure a custom distance and want to add a column.

The formula we have now becomes a shortcut to match_fuzzy with default arguments.

This family of match_* functions can be extrapolated to support features from package fuzzyjoin which were offered as separate functions : difference, distance, genome, geo, regex, interval, stringdist. Our system is more general as we can mix all sorts of these special by arguments (not an obvious feature though, as we wouldn't want to make all cartesian products in one go. The fact that we have all arguments in match_fuzzy makes it convenient so they're all wrappers around it.

Some features can be added, for example counterparts to distance matches functions (match_dist*) could be match_closest*. The counterpart to match_fuzzy could be match_smallest.

What stringr does : instead of a string users can use function fixed() or regex(), these give a class to their output and then the function interprets it.

So far the features we can have are :

match_equal : the only one from the list which is NOT a fuzzy join, supports matching on transformed value, and matching on unsupported types by using digest::digest on elements. (digest is imported by devtools, ggplot2, htmlTools, roxygen2... so pretty much everyone has it)
match_all_equal : a fuzzy join + use of all_equal
match_fuzzy
match_smallest
match_dist / match_closest : includes features of both fuzzyjoin::difference_join and fuzzyjoin::distance_join : they should be together, the first computes distances column by column, the second with all by variables at the same time, so to have the first feature we can just call it twice.
match_dist_gen / match_closest_gen : handles features of `fuzzyjoin::genome_join : doc says it's a twist of interval join though it doesn't call it directly
match_dist_geo / match_closest_geo : handles features of `fuzzyjoin::geo_join
match_regex : This one is a bit weird as it treats differently x and y, the columns in y should contain regex. We can give it a regex_side = c("right","left") argument to be general and make it easier to understand. pattern will be passed to stringr::regex so we can have more flexibility match_regex <- function(x, y , pattern, ignore_case = FALSE, multiline = FALSE, comments = FALSE, dotall = FALSE, regex_side = c("right","left"), ...){...}.
match_interval : start and end are given on each side and rows that have an intersection will be joined, this could be translated by ~ X("end") >= Y("start") & X("start") <= Y("end"), fuzzyjoin uses package IRanges from bioconductor, which avoids the cartesian product, so we should use it as well. This should work with computed values, including constant values (which would be recycled).
match_dist_str / match_closest_str : using stringdist features.

Let's try to stick to consistent interfaces as much as possible, though it's not always possible. Some of the above must just translate their arguments to a formula, and fuzzy would indeed just be alias for ~ if we don't implement rowwise.

Fuzzy matches will have an argument cartesian = TRUE, when FALSE we iterate of the side with the least by groups for fuzzy matching, slower but makes sure RAM usage doesn't get out of control. Could come with a progress bar, or a progress bar argument.

moodymudskipper commented 5 years ago

A cool usecase :

https://stackoverflow.com/questions/55325542/how-to-merge-two-data-table-under-two-conditions/55325955#55325955

moodymudskipper commented 5 years ago

Having 2nd thoughts on names, maybe all of them should start with match so they are verbs and easy to list with autocomplete. Still hesitating if match_dist / match_closest need a _num suffix.

[ ] fuzzy match using ~ to work in combination with standard matches
[ ] match_fuzzy as an "almost alias" to "~" except for the rowwise and ignore_case params and that it supports !!
[ ] match_equal building temp columns removed after joins, with same ignore_case param, if computed columns are lists or data frames (and are of same type), use digest on them.
[] match_all_equal : A match with all_equal (not sensitive to column or row order) achieved with fuzzy_match(all_equal(a,b), rowwise = TRUE), so lists and data frames can be matched if different order of columns or rows
[ ] match_regex as a wrapper around match_fuzzy
[ ] match_interval without IRanges first, have parameters x_min, x_max = x_min, y_min, y_max = y_min
[ ] match_dist as a wrapper around match_fuzzy with a max_value param
[ ] match_dist_str similar
[ ] match_dist_geo similar
[ ] match_closest like match_dist but keeping minimum dist only
[ ] match_closest_str similar
[ ] match_closest_geo similar
[ ] match_dist_gen similar
[ ] match_closest_gen similar

We know what those do but not what they return, it could be just a formula for all the fuzzy matches, with optionally rowwise as a lhs, it could also have a class and a printing method.

But what does match_equal return ? It could be a "regular" exact match of the type temp_x = "temp_y" with the definitions of temp_x and temp_y as attributes, it could have a class of its own. The challenge is that if we want these functions to have no side effect, their output needs to be dealt with downstream. If we want to document them it's better to be able to define them out of the function. classes names ? fuzzy_matching_expression and exact_matching_expression ?

moodymudskipper commented 5 years ago

match_all_equal using all_equal and a fuzzy join is not very efficient.

This feature should be supported by match_equal, just using additional parameters such as ignore_element_order or ignore_row_order (by analogy to ignore_case). Maybe ignore_class to be able to merge lists, data frames and tibbles together (would use unclass). Objects would transformed by sorting or unclassing etc then digest::digest would be used to join.

moodymudskipper commented 5 years ago

go back to this when it's done : https://github.com/moodymudskipper/safejoin/issues/31

moodymudskipper commented 5 years ago

To avoid being too magical, and have readable code, these functions should return objects of given classes. They're invisible to the user so it's ok if names are changed.

They all should return an object of class safe_join_match, this object will have an additional class depending on the kind of magic we use.

safe_join_match_equal for match_equal
safe_join_match_fuzzy for match_fuzzy and wrappers around it
safe_join_match_intervals for match_interval and wrappers around it (uses IRanges) etc...

moodymudskipper commented 5 years ago

I think we need to isolate the by_exact and by_fuzzy like in https://stackoverflow.com/questions/48008903/combined-fuzzy-and-exact-matching/55300322#55300322

except here we start from the .by argument, and need to translate the quosures and match_exact objects to character, and handle the other types of matches.

In the end we need to remove temporary columns created to handle match_exact, which means we also need to names those.

If there are fuzzy joins we need to do first semi joins on the exact variables (here semi means the regular join but without eating any variable), then we have several options:

We can gather the names of all the variables necessary for the cartesian product, and then do apply all conditions with &. It will however apply some join formula on more combinations than necessary. and it excludes the non standard fuzzy joins (using IRanges or Rcpp for instance).
we can have different behaviors depending on join type, if left join loop on fuzzy joins one by one, and we'd add the variables from rhs on the last join. inner join would be the same, for right join we start with a subset containing grouping columns, eaten vars, and join vars, then remove join vars in the end. for full join it's a bit more complex

moodymudskipper commented 2 years ago

The fun now happens at https://github.com/moodymudskipper/powerjoin

moodymudskipper / safejoin

match_* functions to use in `by` arg for different fuzzy or exact matches #33