moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.27k stars 145 forks source link

[FEAT] A percentage threshold for array intersection #2382

Open mihir-packmoose opened 1 week ago

mihir-packmoose commented 1 week ago

Is your proposal related to a problem?

We've been working with splink to do a bit of address matching, and have identified cases where:

  1. We can't use Levenshtein distance because it wouldn't capture nuance (sometimes if most words match with different ordering, the distance is high when we want a metric that is okay with reordered words).
  2. We can't use splink.internals.comparison_level_library.ArrayIntersectLevel because addresses have different sizes and there's no good min_intersection that we can define.

Describe the solution you'd like

We've written a patch into our code that's based on splink.internals.comparison_level_library.ArrayIntersectLevel and defines a new class called ArrayIntersectPercentage that:

  1. Takes a percentage_threshold
  2. Performs the following pseudocode operation in the sql_base_dialect_sql statement:

$$\frac{size(intersection(left\_column,right\_column))}{greatest(size(left\_column),size(right\_column)}\leq percentage\_threshold$$

We would like to make a pull request to this effect, adding this new class to the toolkit that splink officially provides to users as of its current version.

Describe alternatives you've considered

As mentioned before, we've considered and attempted to use Levenshtein distance and splink.internals.comparison_level_library.ArrayIntersectLevel as possible ways of tackling our specific address matching problem, but the patch that we wrote in solved the problem for us, and we think it'll be a nice addition to the entity resolution toolkit.

Additional context

We're relatively new to open source, and any help in getting this feature on par with the rest of the codebase in terms of standards would be much appreciated!

RobinL commented 1 week ago

Hello!

Thanks - this is interest interesting. We have thought about this a bit in the past. I agree that there should be some measure that is able to account for the size of the arrays.

I'm certainly willing to consider adding a percentage intersection according to your definition, but i wnated to raise some points for discussion first:

I think the biggest challenge with using percentage intersection to categorise similarity is fact that the arrays may be a different size.

For example:

[A,B] vs [A,B,C,D]

and

[A,B, E, F] vs [A,B,C,D]

both have 50% intersection, but in many cases, I think we might say that the first example is more similar.

This is somewhat equivalent to the fact that NULLs are genearlly treated as no information, but explicit disagreement usually gets negative weight in probabilistic models. The percentage intersection level is treating them as equivalent

A second issue is as follows:

[A] vs [A,B]
[A,B,C,D,E,F] vs [A,B,C]

Arguably the second example is a much 'bigger match' (e.g. if the arrays were postcodes it would be much less likely to occur by chance).

Another aspect is that as the size of the arrays grow, I think the size of the percentage intersection may grow, and this is especially the case if the field is either low cardinality (a lot number of discrete values), or has high skew (e.g. name, where some names like John and David appear a lot, so once you have several values in the array, it become very likely a few match birthday paradox style).

None of this is to say we shouldn't include a percentage threshold, I just thought I'd highlight to see whether you had further ideas. Perhaps there is a better or more nuanced way of defining it to get at what you're trying to achieve, but that has higher accuracy (is able to capture similarity between arrays with greater nuance).

I'm not sure I have a better solution, but consider something like:

Any thoughts?

mihir-packmoose commented 1 week ago

Hey, Robin, thanks for dropping in! We had an internal discussion about this in our team meeting. We believe that the subset idea will probably help us solve our current problem, and the intersection-with-difference idea also shows good potential as a fallback. We threw together some code and are now testing it to see how it fares.

Thanks again for your insight in solving this problem! I'm setting my PR back into a draft stage while we work this out, and we'll keep you posted about it.

mihir-packmoose commented 1 week ago

Hey, Robin, we just ended up testing these locally and found that exact subsets work really nicely for our use case. Could you take a look at the changes I've just committed and let me know about next steps?