moj-analytical-services / splink_demos

Interactive notebooks containing demonstration code of the splink library
38 stars 27 forks source link

Using splink with multiple email addresses being matched to one #6

Closed samtazzyman closed 2 years ago

samtazzyman commented 4 years ago

Theo suggested that I write this in here.

I've got a list of email addresses (from our Auth0 logs) associated with users. This list has up to 5 email addresses per user, with a varying number per user.

I've also got a list of email addresses associated with SOP. This list has one email address per user.

I would like to link up the two. Ideally I'd like to be able to score the match probability based on the best possible match (I think, I'm open to alternative suggestions).

What I have in fact done is replicate the SOP email address 5 times, so the two tables look like:

Auth0 Each row has 5 fields for email_person (the bit of the email address before the '@'), entitled email_person_0, email_person_1, email_person_2, email_person_3, and email_person_4. Some of these fields are NULL. However, the NULLS work in such a way that if email_person_i is NULL, then so are all email_person_j values for j > i. Which makes things easier

SOP Each row has 5 fields for email_person (the bit of the email address before the '@'), entitled email_person_0, email_person_1, email_person_2, email_person_3, and email_person_4. They are all identical, and none of them are NULL.

I then join the two using splink.

I've done this by defining a few functions:

def one_min_cos_dist_over_thresh(col1, col2, thresh):
    return(f"""((1-cosine_distance(QgramTokeniser({col1}), QgramTokeniser({col2}))) > {thresh})""")

def is_null(col1, col2):
    return(f"""({col1} is null or {col2} is null)""")

def email_person_case_5(t1, t2):
    first_email_l = "email_person_0_l"
    first_email_r = "email_person_0_r"
    second_email_l = "email_person_1_l"
    second_email_r = "email_person_1_r"
    third_email_l = "email_person_2_l"
    third_email_r = "email_person_2_r"
    fourth_email_l = "email_person_3_l"
    fourth_email_r = "email_person_3_r"
    fifth_email_l = "email_person_4_l"
    fifth_email_r = "email_person_4_r"

    return(f"""
CASE WHEN {is_null(first_email_l, first_email_r)} THEN -1
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t2)} THEN 2
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {is_null(second_email_l, second_email_r)} THEN 1
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t2)} THEN 2
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {is_null(third_email_l, third_email_r)} THEN 1
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t2)} THEN 2 
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t1)} AND {is_null(fourth_email_l, fourth_email_r)} THEN 1 
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t1)} AND {one_min_cos_dist_over_thresh(fourth_email_l, fourth_email_r, t2)} THEN 2 
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t1)} AND {one_min_cos_dist_over_thresh(fourth_email_l, fourth_email_r, t1)} AND {is_null(fifth_email_l, fifth_email_r)} THEN 1 
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t1)} AND {one_min_cos_dist_over_thresh(fourth_email_l, fourth_email_r, t1)} AND {one_min_cos_dist_over_thresh(fifth_email_l, fifth_email_r, t2)} THEN 2 
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t1)} AND {one_min_cos_dist_over_thresh(fourth_email_l, fourth_email_r, t1)} AND {one_min_cos_dist_over_thresh(fifth_email_l, fifth_email_r, t1)} THEN 1 
     ELSE 0 END  AS gamma_email_person
""")

I don't pretend these to be the best possibilities - I am intending this as a proof of concept and haven't thought about how to tidy it up and make it more efficient and cleaner.

Then

settings = {
    "link_type": "link_only", 
    "max_iterations": 15,
    "blocking_rules": [
    ],
    "comparison_columns": [
        {
            "custom_name": "email_person",
            "custom_columns_used": ["email_person_0", "email_person_1", "email_person_2", "email_person_3", "email_person_4"],
            "num_levels": 3,
            "case_expression": email_person_case_5(0.74, 0.8)
        }
    ],
    "proportion_of_matches": prop_matches
}

and

from splink import Splink

linker = Splink(settings, spark, df_l=df_l, df_r=df_r)
df_e = linker.get_scored_comparisons()

And Bob's your uncle.

samtazzyman commented 4 years ago

Update: I have actually had more joy (I think) when doing 5 distinct comparisons:

def cos_qgram(col):
    return(f"""
case when {col}_l is null or {col}_r is null then -1 
     when (1- cosine_distance(QgramTokeniser({col}_l), QgramTokeniser({col}_r))) > 0.84 then 2 
     when (1-cosine_distance(QgramTokeniser({col}_l), QgramTokeniser({col}_r))) > 0.70 then 1 
     else 0 end  as gamma_{col}          

""")

settings = {
    "link_type": "link_only", 
    "max_iterations": 30,
    "comparison_columns": [
        {
            "col_name": "email_person_0",
            "num_levels": 3,
            "case_expression": cos_qgram("email_person_0")
        },
        {
            "col_name": "email_person_1",
            "num_levels": 3,
            "case_expression": cos_qgram("email_person_1")
        },
        {
            "col_name": "email_person_2",
            "num_levels": 3,
            "case_expression": cos_qgram("email_person_2")
        },
        {
            "col_name": "email_person_3",
            "num_levels": 3,
            "case_expression": cos_qgram("email_person_3")
        },
        {
            "col_name": "email_person_4",
            "num_levels": 3,
            "case_expression": cos_qgram("email_person_4")
        }
    ],
    "proportion_of_matches": prop_matches,
    "retain_intermediate_calculation_columns": False
}

I think this makes sense because doing them all in one go is sort of throwing information away.