moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.36k stars 148 forks source link

[FEAT] Interaction term between two correlated comparisons #2413

Open V-Lamp opened 1 month ago

V-Lamp commented 1 month ago

Is your proposal related to a problem?

Terms with independent comparisons, e.g. Postcode and City can be very correlated, so two independent comparisons for postcode & city will lead to overestimating match score when both match, or underestimating it when only one matches.

Describe the solution you'd like

Some mechanism to score the interaction between two comparisons (usually negatively, like in term frequency).

Describe alternatives you've considered

So far I have put city as a lower comparison level to postcode, but I expect this problem of correlated comparisons to be more general. Ordering of levels is also very sensitive to the precision of postcodes (e.g. UK postcode vs 5 digit US zip code). So an interaction would make the model less "hand-tuned" due to manual ordering of levels.

Additional context

Creating an interaction term is a common mechanism in dealing with correlation in Machine Learning, e.g. if x1 and x2 are correlated, you can add a term x1x2 in your model, e.g. `y = ax1+bx2+cx1*x2+d`

RobinL commented 1 month ago

I think you can do this already using this kind of syntax:

import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

df = splink_datasets.fake_1000

df["postcode"] = df["email"].str.slice(0, 3)

# Define custom comparison for postcode and city
postcode_city_comparison = cl.CustomComparison(
    output_column_name="postcode_city",
    comparison_levels=[
        cll.And(cll.NullLevel("postcode"), cll.NullLevel("postcode")),
        {
            "sql_condition": "postcode_l = postcode_r AND city_l = city_r",
            "label_for_charts": "Exact match on both postcode and city",
        },
        cll.ExactMatchLevel("postcode").configure(label_for_charts="Different city, exact match on postcode"),
        cll.ExactMatchLevel("city").configure(label_for_charts="Different postcode, exact match on city"),
        cll.ElseLevel(),
    ],
)

# Define settings
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    comparisons=[
        cl.NameComparison("first_name"),
        cl.NameComparison("surname"),
        cl.DateOfBirthComparison("dob", input_is_string=True),
        postcode_city_comparison,
    ],
    max_iterations=5,
)

linker = Linker(df, settings, db_api=DuckDBAPI())

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("first_name", "surname")
)
linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))

linker.visualisations.match_weights_chart()
zmbc commented 1 month ago

@RobinL's solution is equivalent to method 1 in S4 of the appendix of the fastLink paper, and works great for some use-cases, but doesn't allow e.g. specifying c1 c2 and c2 c3 without c1 c2 c3. For that, the second method in that appendix based on log-linear models is the solution, and there is a discussion of adding it to Splink here: https://github.com/moj-analytical-services/splink/discussions/1310

V-Lamp commented 3 weeks ago

Thank you for the response around using AND, I think I can incorporate it. However, one complexity that usually comes in practice is that comparisons usually have more than one comparison level.

For example:

postcode_comparison = [postcode_exact_match, postcode_area_match, postcode_sector_match]
address_comparison = [address_exact_match, street_name_match, address_fuzzy_match]
location_comparison = ???

I would need to make 9 levels (3 3) with AND, plus the other 3 + 3 levels, to define a location_comparison (so 33 + 3 + 3 levels in total). The challenging thing then is to find what is the right order for these 15 comparison levels, since ordering has a very significant effect. In my case, I actually have more that 3 levels, more like 6-8.

Have you found yourself in this combinatorial explosion and then further ordering problem?

RobinL commented 3 weeks ago

Yeah - you're right to highlight these challenges. It's typically best to try and order in terms of 'better matches higher' - start with the most precise matches and work your way down. Although I appreciate it's not always obvious in practice; you may need some trial and error.

I agree that the combinatorial explosion problem is real, but on a large dataset having (say) 9 comparison levels is totally fine. Ultimately, each one corresponds to two parameters to estimate, so 18 parameters is not very many at all (compared to, say, other ML approaches which can have thousands).

In our production models we tend to have between about 2 and 10 comparison levels per comparison