Using splink with multiple email addresses being matched to one

Theo suggested that I write this in here.

I've got a list of email addresses (from our Auth0 logs) associated with users. This list has up to 5 email addresses per user, with a varying number per user.

I've also got a list of email addresses associated with SOP. This list has one email address per user.

I would like to link up the two. Ideally I'd like to be able to score the match probability based on the best possible match (I think, I'm open to alternative suggestions).

What I have in fact done is replicate the SOP email address 5 times, so the two tables look like:

Auth0 Each row has 5 fields for email_person (the bit of the email address before the '@'), entitled email_person_0, email_person_1, email_person_2, email_person_3, and email_person_4. Some of these fields are NULL. However, the NULLS work in such a way that if email_person_i is NULL, then so are all email_person_j values for j > i. Which makes things easier

SOP Each row has 5 fields for email_person (the bit of the email address before the '@'), entitled email_person_0, email_person_1, email_person_2, email_person_3, and email_person_4. They are all identical, and none of them are NULL.

I then join the two using splink.

I've done this by defining a few functions:

def one_min_cos_dist_over_thresh(col1, col2, thresh):
    return(f"""((1-cosine_distance(QgramTokeniser({col1}), QgramTokeniser({col2}))) > {thresh})""")

def is_null(col1, col2):
    return(f"""({col1} is null or {col2} is null)""")

def email_person_case_5(t1, t2):
    first_email_l = "email_person_0_l"
    first_email_r = "email_person_0_r"
    second_email_l = "email_person_1_l"
    second_email_r = "email_person_1_r"
    third_email_l = "email_person_2_l"
    third_email_r = "email_person_2_r"
    fourth_email_l = "email_person_3_l"
    fourth_email_r = "email_person_3_r"
    fifth_email_l = "email_person_4_l"
    fifth_email_r = "email_person_4_r"

    return(f"""
CASE WHEN {is_null(first_email_l, first_email_r)} THEN -1
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t2)} THEN 2
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {is_null(second_email_l, second_email_r)} THEN 1
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t2)} THEN 2
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {is_null(third_email_l, third_email_r)} THEN 1
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t2)} THEN 2 
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t1)} AND {is_null(fourth_email_l, fourth_email_r)} THEN 1 
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t1)} AND {one_min_cos_dist_over_thresh(fourth_email_l, fourth_email_r, t2)} THEN 2 
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t1)} AND {one_min_cos_dist_over_thresh(fourth_email_l, fourth_email_r, t1)} AND {is_null(fifth_email_l, fifth_email_r)} THEN 1 
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t1)} AND {one_min_cos_dist_over_thresh(fourth_email_l, fourth_email_r, t1)} AND {one_min_cos_dist_over_thresh(fifth_email_l, fifth_email_r, t2)} THEN 2 
     WHEN {one_min_cos_dist_over_thresh(first_email_l, first_email_r, t1)} AND {one_min_cos_dist_over_thresh(second_email_l, second_email_r, t1)} AND {one_min_cos_dist_over_thresh(third_email_l, third_email_r, t1)} AND {one_min_cos_dist_over_thresh(fourth_email_l, fourth_email_r, t1)} AND {one_min_cos_dist_over_thresh(fifth_email_l, fifth_email_r, t1)} THEN 1 
     ELSE 0 END  AS gamma_email_person
""")

I don't pretend these to be the best possibilities - I am intending this as a proof of concept and haven't thought about how to tidy it up and make it more efficient and cleaner.

Then

settings = {
    "link_type": "link_only", 
    "max_iterations": 15,
    "blocking_rules": [
    ],
    "comparison_columns": [
        {
            "custom_name": "email_person",
            "custom_columns_used": ["email_person_0", "email_person_1", "email_person_2", "email_person_3", "email_person_4"],
            "num_levels": 3,
            "case_expression": email_person_case_5(0.74, 0.8)
        }
    ],
    "proportion_of_matches": prop_matches
}

and

from splink import Splink

linker = Splink(settings, spark, df_l=df_l, df_r=df_r)
df_e = linker.get_scored_comparisons()

And Bob's your uncle.

moj-analytical-services / splink_demos

Using splink with multiple email addresses being matched to one #6