Open RobinL opened 3 years ago
The key assumption we're testing for is stated here:
Second, the conditional independence among linkage variables is assumed given the match status.
How can we test for this?
If we have match status (labelled data), we can run OLS regressions like this:
gamma_col_1 = f(intercept, gamma_col_2, gamma_col_3,...)
gamma_col_2 = f(intercept, gamma_col_1, gamma_col_3,...)
We have to run these twice:
And we expect these regressions to have r^2 of 0 if the assumption holds.
This is the concept of the Variance Inflation Factor, which is measured as vif = 1.0 / (1.0 - r_sq)
and we expect to be 1.00 if the assumption holds.
Of course, we cannot do this because we do not have labelled data.
Can we instead run the VIF on ALL data (matches AND non-matches) and expect it to be a useful proxy measure of _how badly) the assumption of conditional independence among linkage variables given the match status is broken.
First, note that we should not expect it to be equal to one, if we run it on all data. Intuitively, we fully expect some correlation. For example, we expect there to be more more matches among surname if there's a match on first name.
But what if we see a very high value? Is this indicative of a problem?
One degenerate form of the model is where we put two variables in which measure exactly the same thing. e.g. if we had two columns, containing the same measure of height in meters and in cm. In this case, the r_sq of the regression would be 1.0 (in BOTH the case of running regressions on ALL data, on matches and non matches separately)
Another case in which we expect to see very high values is where we we have a very high proportion of non-matches (e.g. a case where no blocking is used). If the vast majority of records have 0s in gamma columns, then you can predict any one gamma column with the others with high accuracy
Further thoughts:
This helps clarify in my head what sort of correlations are bad. I could never articulate this before. So before i always thought 'well, it feels obvious that in our data if you see a match on first name, you're more likely to see a match on surname'. So does this break the independence assumption? (If this were not true, then the model wouldn't work at all!)
But we're not worried about this. We're worried about cases where the two variables are measuring the 'same' agreement or disagreement.
Suppose that it's common for names to be correlated. e.g. that people who are called 'mark' are more likely to have a middle name 'anthony'. If we looked at non-matches, this would mean a match on first name was more likely to imply a match on second name.
Example of high VIF using data that meets the conditional independence assumption:
import pandas as pd
import numpy as np
gamma_settings_4 = {
"link_type": "dedupe_only",
"proportion_of_matches":0.9,
"comparison_columns": [
{
"col_name": "col_2_levels",
"num_levels": 2
},
{
"col_name": "col_5_levels",
"num_levels": 2
}
],
"blocking_rules": []
}
col_names = [c["col_name"] for c in gamma_settings_4["comparison_columns"]]
## Create df gammas for non-matches
probs = [
0.01,
0.01,
] # Amongst non-matches, gamma_0 agrees 5% of the time, gamma_1 agrees 20% of the time etc
iprobs = [1 / p for p in probs]
df_nm = None
for index, num_options in enumerate(iprobs):
col_name = col_names[index]
n = int(num_options)
df_nm_new = pd.DataFrame(
{f"gamma_{col_name}": [0] * (n - 1) + [1], "join_col": [1] * n}
) # Creates n rec
if df_nm is not None:
df_nm = df_nm.merge(df_nm_new, left_on="join_col", right_on="join_col")
else:
df_nm = df_nm_new
df_nm = df_nm.drop("join_col", axis=1)
df_nm["true_match"] = 0
## Create df gammas for non-matches
probs = [
0.01,
0.01
] # Amongst matches, gamma_0 DISAGREES 5% of the time, gamma_1 DISAGREES 10% of the time etc
iprobs = [1 / p for p in probs]
df_m = None
for index, num_options in enumerate(iprobs):
col_name = col_names[index]
n = int(num_options)
df_m_new = pd.DataFrame(
{f"gamma_{col_name}": [1] * (n - 1) + [0], "join_col": [1] * n}
)
if df_m is not None:
df_m = df_m.merge(df_m_new, left_on="join_col", right_on="join_col")
else:
df_m = df_m_new
df_m = df_m.drop("join_col", axis=1)
df_m["true_match"] = 1
df_all = pd.concat([df_nm, df_m])
df_all = df_all.reset_index()
import statsmodels.api as sm
# f1 = df_all['true_match'] == 1
# df_all_m = df_all[f1]
df_all_m = df_all
y = df_all_m['gamma_col_2_levels']
X = df_all_m[['gamma_col_5_levels']]
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
results.params
r_sq = results.rsquared
vif = 1.0 / (1.0 - r_sq)
vif
Is it possible that correlation amongst non-matching records is more of a problem than correlation amongst matching?
Generally positive evidence in favour of a match is driven by the u values, and what we are most concerned about is double counting.
If we have high multicollinearity, the Fellegi Sunter model will estimate probabilities which are 'too certain' (too close to 0 or 1) because it assumes independence.
Could we use the variance inflation factor as a measure of multicollinearity?