skrub-data / skrub

Prepping tables for machine learning
https://skrub-data.org/
BSD 3-Clause "New" or "Revised" License
1.22k stars 97 forks source link

[ENH] Add exact match columns constraint on Joiner #1113

Open Vincent-Maladiere opened 1 month ago

Vincent-Maladiere commented 1 month ago

Problem Description

Some applications call for a partially fuzzy join, meaning fuzzy joining within groups of exactly matched entities.

For instance, matching loans from two tables of users having multiple loans, when there is no loan_id. In this scenario, constraining the fuzzy join on loans belonging to the same users (having a user_id) would make sense. Within these groups, we would next perform fuzzy joining on loan prices and loan creation dates, for example.

Feature Description

We could have multiple strategies to use constraints and units that have a business meaning:

  1. The user could pass a custom distance function to define the weights between loan price distance and creation date distance.
  2. The user could indicate a single column to minimize while constraining other columns to some threshold distance. For instance, minimizing the price distance while keeping the date distances within a range of one day.

Alternative Solutions

No response

Additional Context

Fuzzy joining different columns on the same l2 space currently limits the application of Joiner and fuzzy_join to tangible use cases.

GaelVaroquaux commented 1 month ago

I'm not sure that exact matching is what you are looking for.

Indeed, constraints of exact matching can lead to crashing on new objects, which is a behavior I would really strive to avoid by default.

However, I can see that you want to prioritize the corresponding column in the fuzzy match when it's in a multi-column setting. We need to find an API to do this that is easily understandable by people (as always, I fight for not adding feature that are going to be used by 0.01 of users)

Vincent-Maladiere commented 1 month ago

I'm not sure that exact matching is what you are looking for.

I guess it is when IDs must match exactly before performing fuzzy join, right? In a scenario where joining on an ID that is close but different would be a mistake.

jeromedockes commented 1 month ago

could there also be situations where this helps narrow down the nearest neighbor search and thus reduce computation & memory? in the example you give above we would only compute pairwise distances between loans of a given user, not of all users

GaelVaroquaux commented 1 month ago

Good point!