moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.17k stars 131 forks source link

Need a diagnostic that helps user understand correlation in input columns #131

Open RobinL opened 3 years ago

RobinL commented 3 years ago

If we have high multicollinearity, the Fellegi Sunter model will estimate probabilities which are 'too certain' (too close to 0 or 1) because it assumes independence.

Could we use the variance inflation factor as a measure of multicollinearity?

RobinL commented 3 years ago

The key assumption we're testing for is stated here:

Second, the conditional independence among linkage variables is assumed given the match status.

How can we test for this?

If we have match status (labelled data), we can run OLS regressions like this:

gamma_col_1 = f(intercept, gamma_col_2, gamma_col_3,...)
gamma_col_2 = f(intercept, gamma_col_1, gamma_col_3,...)

We have to run these twice:

And we expect these regressions to have r^2 of 0 if the assumption holds.

This is the concept of the Variance Inflation Factor, which is measured as vif = 1.0 / (1.0 - r_sq) and we expect to be 1.00 if the assumption holds.

Of course, we cannot do this because we do not have labelled data.

Can we instead run the VIF on ALL data (matches AND non-matches) and expect it to be a useful proxy measure of _how badly) the assumption of conditional independence among linkage variables given the match status is broken.

First, note that we should not expect it to be equal to one, if we run it on all data. Intuitively, we fully expect some correlation. For example, we expect there to be more more matches among surname if there's a match on first name.

But what if we see a very high value? Is this indicative of a problem?

One degenerate form of the model is where we put two variables in which measure exactly the same thing. e.g. if we had two columns, containing the same measure of height in meters and in cm. In this case, the r_sq of the regression would be 1.0 (in BOTH the case of running regressions on ALL data, on matches and non matches separately)

Another case in which we expect to see very high values is where we we have a very high proportion of non-matches (e.g. a case where no blocking is used). If the vast majority of records have 0s in gamma columns, then you can predict any one gamma column with the others with high accuracy

RobinL commented 3 years ago

Further thoughts:

This helps clarify in my head what sort of correlations are bad. I could never articulate this before. So before i always thought 'well, it feels obvious that in our data if you see a match on first name, you're more likely to see a match on surname'. So does this break the independence assumption? (If this were not true, then the model wouldn't work at all!)

But we're not worried about this. We're worried about cases where the two variables are measuring the 'same' agreement or disagreement.

Suppose that it's common for names to be correlated. e.g. that people who are called 'mark' are more likely to have a middle name 'anthony'. If we looked at non-matches, this would mean a match on first name was more likely to imply a match on second name.

RobinL commented 3 years ago

Example of high VIF using data that meets the conditional independence assumption:

import pandas as pd 
import numpy as np

gamma_settings_4 = {
    "link_type": "dedupe_only",
    "proportion_of_matches":0.9,
    "comparison_columns": [
        {
            "col_name": "col_2_levels",
            "num_levels": 2
        },
        {
            "col_name": "col_5_levels",
            "num_levels": 2
        }
    ],
    "blocking_rules": []
}

col_names = [c["col_name"] for c in gamma_settings_4["comparison_columns"]]
## Create df gammas for non-matches
probs = [
    0.01,
    0.01,
]  # Amongst non-matches, gamma_0 agrees 5% of the time, gamma_1 agrees 20% of the time etc
iprobs = [1 / p for p in probs]

df_nm = None

for index, num_options in enumerate(iprobs):
    col_name = col_names[index]
    n = int(num_options)
    df_nm_new = pd.DataFrame(
        {f"gamma_{col_name}": [0] * (n - 1) + [1], "join_col": [1] * n}
    )  # Creates n rec
    if df_nm is not None:
        df_nm = df_nm.merge(df_nm_new, left_on="join_col", right_on="join_col")

    else:
        df_nm = df_nm_new
df_nm = df_nm.drop("join_col", axis=1)
df_nm["true_match"] = 0

## Create df gammas for non-matches
probs = [
    0.01,
    0.01
]  # Amongst matches, gamma_0 DISAGREES 5% of the time, gamma_1 DISAGREES 10% of the time etc
iprobs = [1 / p for p in probs]

df_m = None
for index, num_options in enumerate(iprobs):
    col_name = col_names[index]
    n = int(num_options)
    df_m_new = pd.DataFrame(
        {f"gamma_{col_name}": [1] * (n - 1) + [0], "join_col": [1] * n}
    )
    if df_m is not None:
        df_m = df_m.merge(df_m_new, left_on="join_col", right_on="join_col")

    else:
        df_m = df_m_new
df_m = df_m.drop("join_col", axis=1)
df_m["true_match"] = 1

df_all = pd.concat([df_nm, df_m])
df_all = df_all.reset_index()

import statsmodels.api as sm
# f1 = df_all['true_match'] == 1
# df_all_m = df_all[f1] 
df_all_m = df_all

y = df_all_m['gamma_col_2_levels']

X = df_all_m[['gamma_col_5_levels']]
X = sm.add_constant(X)

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

results.params

r_sq =   results.rsquared
vif = 1.0 / (1.0 - r_sq)
vif
RobinL commented 2 years ago

Is it possible that correlation amongst non-matching records is more of a problem than correlation amongst matching?

Generally positive evidence in favour of a match is driven by the u values, and what we are most concerned about is double counting.