moodymudskipper / safejoin

Wrappers around dplyr functions to join safely using various checks
GNU General Public License v3.0
42 stars 7 forks source link

With fuzzy matching, column conflict on by columns is awkward #25

Closed moodymudskipper closed 2 years ago

moodymudskipper commented 5 years ago

safejoin doesn't like ".x" and ".y" suffixes, it's a good thing but fuzzy joining very often duplicates the joining columns, with eat we can use the prefix argument but it's not part of safe_*_join functions, and it would be quite convenient to be able to rename on the fly in X and Y.

safe_inner_join(df1, df2, by= ~X(id_x = "id") > Y(id_y = "id"))

That's not perfect though, as we have to decide what happens when X or Y are used several times but unconsistently :

safe_inner_join(df1, df2, by=~f(X(id_x = "id"), Y(id_y = "id")) == g(X(id_1 = "id"), Y("id"))

Probably better to just trigger error in case of unconsistency, but it means we need to be redundant. To avoid redundancy we can admit one renaming maximum, and apply it everywhere, but maybe it's confusing ?

moodymudskipper commented 5 years ago

has to work hand in hand with fuzzy_keep argument...

moodymudskipper commented 2 years ago

The fun now happens at https://github.com/moodymudskipper/powerjoin