Open Vincent-Maladiere opened 1 month ago
I'm not sure that exact matching is what you are looking for.
Indeed, constraints of exact matching can lead to crashing on new objects, which is a behavior I would really strive to avoid by default.
However, I can see that you want to prioritize the corresponding column in the fuzzy match when it's in a multi-column setting. We need to find an API to do this that is easily understandable by people (as always, I fight for not adding feature that are going to be used by 0.01 of users)
I'm not sure that exact matching is what you are looking for.
I guess it is when IDs must match exactly before performing fuzzy join, right? In a scenario where joining on an ID that is close but different would be a mistake.
could there also be situations where this helps narrow down the nearest neighbor search and thus reduce computation & memory? in the example you give above we would only compute pairwise distances between loans of a given user, not of all users
Good point!
Problem Description
Some applications call for a partially fuzzy join, meaning fuzzy joining within groups of exactly matched entities.
For instance, matching loans from two tables of users having multiple loans, when there is no
loan_id
. In this scenario, constraining the fuzzy join on loans belonging to the same users (having auser_id
) would make sense. Within these groups, we would next perform fuzzy joining on loan prices and loan creation dates, for example.Feature Description
We could have multiple strategies to use constraints and units that have a business meaning:
Alternative Solutions
No response
Additional Context
Fuzzy joining different columns on the same l2 space currently limits the application of
Joiner
andfuzzy_join
to tangible use cases.