moodymudskipper / safejoin

Wrappers around dplyr functions to join safely using various checks
GNU General Public License v3.0
42 stars 7 forks source link

match_* functions to use in `by` arg for different fuzzy or exact matches #33

Closed moodymudskipper closed 2 years ago

moodymudskipper commented 5 years ago

Sometimes we need to do a transformation before a join and drop it afterward, ignoring case is a special occurence of the use case.

For instance:

df1 %>% 
  mutate(join_col_1 = fun(col)) %>% 
  left_join(df2, by = c(join_col_1 = "join_col_2") %>%
  select(-join_col_1)

It can technically be done by "abusing" the fuzzy join features :

df1 %>% 
  safe_left_join(df2, by = ~ fun(X("col")) == Y("join_col_2"))

but we're doing a very costly cartesian product here.

The following interface would be intuitive :

df1 %>% 
  safe_left_join(df2, by = match_equal(fun(col), join_col_2))

Temp columns would be created, be used for the match then dropped.

Instead of a simple ~ the fuzzy match could work with match_fuzzy so we have a consistent interface. This would allow additional parameters such as ignore_case = FALSE, rowwise = FALSE, cartesian = FALSE and match_vars = NULL.

We could implement a max_value and dist_col argument to be explicit about cases where we measure a custom distance and want to add a column.

The formula we have now becomes a shortcut to match_fuzzy with default arguments.

This family of match_* functions can be extrapolated to support features from package fuzzyjoin which were offered as separate functions : difference, distance, genome, geo, regex, interval, stringdist. Our system is more general as we can mix all sorts of these special by arguments (not an obvious feature though, as we wouldn't want to make all cartesian products in one go. The fact that we have all arguments in match_fuzzy makes it convenient so they're all wrappers around it.

Some features can be added, for example counterparts to distance matches functions (match_dist*) could be match_closest*. The counterpart to match_fuzzy could be match_smallest.

What stringr does : instead of a string users can use function fixed() or regex(), these give a class to their output and then the function interprets it.

So far the features we can have are :

Let's try to stick to consistent interfaces as much as possible, though it's not always possible. Some of the above must just translate their arguments to a formula, and fuzzy would indeed just be alias for ~ if we don't implement rowwise.

Fuzzy matches will have an argument cartesian = TRUE, when FALSE we iterate of the side with the least by groups for fuzzy matching, slower but makes sure RAM usage doesn't get out of control. Could come with a progress bar, or a progress bar argument.

moodymudskipper commented 5 years ago

A cool usecase :

https://stackoverflow.com/questions/55325542/how-to-merge-two-data-table-under-two-conditions/55325955#55325955

moodymudskipper commented 5 years ago

Having 2nd thoughts on names, maybe all of them should start with match so they are verbs and easy to list with autocomplete. Still hesitating if match_dist / match_closest need a _num suffix.

We know what those do but not what they return, it could be just a formula for all the fuzzy matches, with optionally rowwise as a lhs, it could also have a class and a printing method.

But what does match_equal return ? It could be a "regular" exact match of the type temp_x = "temp_y" with the definitions of temp_x and temp_y as attributes, it could have a class of its own. The challenge is that if we want these functions to have no side effect, their output needs to be dealt with downstream. If we want to document them it's better to be able to define them out of the function. classes names ? fuzzy_matching_expression and exact_matching_expression ?

moodymudskipper commented 5 years ago

match_all_equal using all_equal and a fuzzy join is not very efficient.

This feature should be supported by match_equal, just using additional parameters such as ignore_element_order or ignore_row_order (by analogy to ignore_case). Maybe ignore_class to be able to merge lists, data frames and tibbles together (would use unclass). Objects would transformed by sorting or unclassing etc then digest::digest would be used to join.

moodymudskipper commented 5 years ago

go back to this when it's done : https://github.com/moodymudskipper/safejoin/issues/31

moodymudskipper commented 5 years ago

To avoid being too magical, and have readable code, these functions should return objects of given classes. They're invisible to the user so it's ok if names are changed.

They all should return an object of class safe_join_match, this object will have an additional class depending on the kind of magic we use.

moodymudskipper commented 5 years ago

I think we need to isolate the by_exact and by_fuzzy like in https://stackoverflow.com/questions/48008903/combined-fuzzy-and-exact-matching/55300322#55300322

except here we start from the .by argument, and need to translate the quosures and match_exact objects to character, and handle the other types of matches.

In the end we need to remove temporary columns created to handle match_exact, which means we also need to names those.

If there are fuzzy joins we need to do first semi joins on the exact variables (here semi means the regular join but without eating any variable), then we have several options:

moodymudskipper commented 2 years ago

The fun now happens at https://github.com/moodymudskipper/powerjoin