Closed samtazzyman closed 2 years ago
Update: I have actually had more joy (I think) when doing 5 distinct comparisons:
def cos_qgram(col):
return(f"""
case when {col}_l is null or {col}_r is null then -1
when (1- cosine_distance(QgramTokeniser({col}_l), QgramTokeniser({col}_r))) > 0.84 then 2
when (1-cosine_distance(QgramTokeniser({col}_l), QgramTokeniser({col}_r))) > 0.70 then 1
else 0 end as gamma_{col}
""")
settings = {
"link_type": "link_only",
"max_iterations": 30,
"comparison_columns": [
{
"col_name": "email_person_0",
"num_levels": 3,
"case_expression": cos_qgram("email_person_0")
},
{
"col_name": "email_person_1",
"num_levels": 3,
"case_expression": cos_qgram("email_person_1")
},
{
"col_name": "email_person_2",
"num_levels": 3,
"case_expression": cos_qgram("email_person_2")
},
{
"col_name": "email_person_3",
"num_levels": 3,
"case_expression": cos_qgram("email_person_3")
},
{
"col_name": "email_person_4",
"num_levels": 3,
"case_expression": cos_qgram("email_person_4")
}
],
"proportion_of_matches": prop_matches,
"retain_intermediate_calculation_columns": False
}
I think this makes sense because doing them all in one go is sort of throwing information away.
Theo suggested that I write this in here.
I've got a list of email addresses (from our Auth0 logs) associated with users. This list has up to 5 email addresses per user, with a varying number per user.
I've also got a list of email addresses associated with SOP. This list has one email address per user.
I would like to link up the two. Ideally I'd like to be able to score the match probability based on the best possible match (I think, I'm open to alternative suggestions).
What I have in fact done is replicate the SOP email address 5 times, so the two tables look like:
Auth0 Each row has 5 fields for
email_person
(the bit of the email address before the '@'), entitledemail_person_0
,email_person_1
,email_person_2
,email_person_3
, andemail_person_4
. Some of these fields are NULL. However, the NULLS work in such a way that ifemail_person_i
is NULL, then so are allemail_person_j
values forj > i
. Which makes things easierSOP Each row has 5 fields for
email_person
(the bit of the email address before the '@'), entitledemail_person_0
,email_person_1
,email_person_2
,email_person_3
, andemail_person_4
. They are all identical, and none of them are NULL.I then join the two using splink.
I've done this by defining a few functions:
I don't pretend these to be the best possibilities - I am intending this as a proof of concept and haven't thought about how to tidy it up and make it more efficient and cleaner.
Then
and
And Bob's your uncle.