EM silently ignores salting

moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

https://moj-analytical-services.github.io/splink/

MIT License

1.37k stars 150 forks source link

EM silently ignores salting #1817

Closed zmbc closed 10 months ago

zmbc commented 11 months ago

What happens?

The estimate_parameters_using_expectation_maximisation method takes a blocking rule, but if you pass in a blocking rule that includes "salting_partitions" it is silently ignored.

This is because the SQL provided by the user is extracted here, not the SQL that would be generated:

https://github.com/moj-analytical-services/splink/blob/ac1cf23d41215492b68d56841c46de3bd93c7e99/splink/linker.py#L1655

I am not sure whether this should be an error, or whether EM should actually use salting. It seems like the challenge of unbalanced partitions would come up in the EM case; unsure whether salting is an appropriate solution.

To Reproduce

# Begin by reading in the tutorial data again
from splink.duckdb.linker import DuckDBLinker
from splink.datasets import splink_datasets
import altair as alt
df = splink_datasets.fake_1000

import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on

settings = {
    "link_type": "dedupe_only",
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", include_username_fuzzy_level=False),
    ],
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
        block_on("surname"),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}

linker = DuckDBLinker(df, settings)

deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email"
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)

linker.estimate_u_using_random_sampling(max_pairs=1e6)

training_blocking_rule = {"blocking_rule": "l.first_name = r.first_name and l.surname = r.surname", "salting_partitions": 5}
training_session_fname_sname = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)

print(training_session_fname_sname._blocking_rule_for_training.salting_partitions)

prints 1.

OS:

Linux

Splink version:

3.9.9

Have you tried this on the latest `master` branch?

[X] I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

[X] I agree

RobinL commented 10 months ago

Thanks for the report. You're right, salting should be respected during EM training but currently it is not. We should fix this.

The reason we haven't respected this option in the past is that salting was seldom used at all, and further, it's less important for EM training than for predict(), since often EM training runs create fewer pairs.

But you're totally right, there are some instances in which it may be of value. Especially since we've recently found that salting can very substantially speed up duckdb workflows.

zmbc commented 10 months ago

since often EM training runs create fewer pairs.

Interesting. I had not been using a very restrictive blocking rule for training.

Is there documentation somewhere of why splink recommends this approach? I haven't come across any methods papers that describe it, and I'd be concerned in particular with how it could introduce additional reliance on conditional independence, and how "backing out" the blocking rule works when the relevant parameters for the blocking columns haven't yet been estimated (seems like a bit of a circular problem).

RobinL commented 10 months ago

Hiya. The best docs are here, but they're not great: https://moj-analytical-services.github.io/splink/topic_guides/blocking/model_training.html https://moj-analytical-services.github.io/splink/topic_guides/blocking/predictions.html

You're completely right, though - the approach we use to EM training where we block on a subset of variables, and then train the parameters of the remaining variables, is susceptible to bias when the conditional indepdence assumption is violated (i.e. in most real-world examples!).

One way to mitigate this to a certain extent is to train the m params multiple times and look for stability.

So for example, to train a first_name variable, you could do an EM training run blocking on:

surname and dob
postcode and middle name

and check that the trained m parameters are not that different using this: https://moj-analytical-services.github.io/splink/charts/parameter_estimate_comparisons_chart.html?h=parameter+estimate

One final thing to say is that we've found in practice that accurately is not that sensitive to m values. So long as they are ballpark correct you usually get pretty good prediction results.

zmbc commented 10 months ago

is susceptible to bias when the conditional indepdence assumption is violated

But it's more than that, right? If I create a new model and run a training session blocking on surname and dob, the m parameters for surname and dob themselves haven't been estimated yet. So I assume the "backing out" is just done with the default initial values, which could be quite inaccurate.

My understanding is that Splink more or less intends to be "fastLink at scale" -- so I'm curious why Splink doesn't implement EM based on a random sample of pairs, instead of an exhaustive enumeration of pairs meeting some criteria. In the fastLink paper they recommend running EM on only 5% of the dataset instead of the whole thing, and report "nearly indistinguishable" parameter estimates.

RobinL commented 10 months ago

I'll start by saying you might be right - a simple sampling approach may be best - we haven't tried it and we should. At a minimum, it should be an option.

Let me try and explain why we didn't go down that route:

When data becomes large, you need a very large sample. For example, if you have 1 million input records, each of which has one duplicate, and you sample a billion pairs, then only 1,000 will be a match. For a detailed settings object with e.g. 2 levels of jaro winkler fuzzy matching, there may be no examples of (say) the second fuzzy level within those 1,000. More broadly, you might not have enough examples to get a very precise estimate.
The strategy is probably infeasible for very large datasets (e.g. 10 million records plus) due to the above issue. You'd need a massive sample even to achieve a small sample of matches.
As a result, we recommend the blocking/round robin approach (because as soon as you block, you get far more matches in your sample)

Finally, I'll try to clarify the point re 'susceptible to bias '. I'm not sure I agree with your point, but bear with me as it's a little tricky to explain.

Consider a model with first_name, surname, and dob

Let's suppose we want to estimate the m (and optionally u) probabilities for surname and dob (but not first_name). Suppose we have infinite data and processing power. Under (and only under) the conditional independence assumption, the following two estimation procedures will give the same results:

Using EM with no blocking
Using EM blocking on first_name

Using EM with sampled blocking will also give essentially the same result (less precision, but no bias)

Note that the starting parameters of first_name are irrelevant because EM converges to the correct (population) parameters for the surname and dob m and u estimates.

This is what leads to our recommended 'round robin' approach, where you e.g. block on first_name, to get the suranme and dob estimates, and then block on e.g. surname to get the dob and first_name estimates. You can get multiple estimate (e.g. here you get two estimates for dob). These are averaged

Finally, on 'backing out' (probably this term is ambiguous, so I might be misinterpreting what you mean by it)

The only parameter estimate we used to back out was lambda (probability_two_random_records_match). However, in practice we found this didn't work well, so we switched to using a direct estimation procedue

See: https://github.com/moj-analytical-services/splink/issues/462 https://github.com/moj-analytical-services/splink/pull/734

p.s. I get the sense you know at least as much as we do about all of this! Would be happy to have a chat (and would be interesting to hear your use case). Feel free to reach out at robinlinacre@hotmail.com if you're interested.

zmbc commented 10 months ago

When data becomes large, you need a very large sample.

Yes, and the fastLink paper does not address this. This was my guess of why you headed down this road, and it seems reasonable. But also under-theorized, and not adequately (to my mind) highlighted as such.

I am aware of one other approach to dealing with this problem, which I read in this paper: https://doi.org/10.2478/jos-2020-0039. Essentially, it uses a weighted form of random sampling, and also proposes using the most and least discordant pairs (which heavily relies on conditional independence). I think you could do either or both of these things. There are probably other approaches too.

Re: your explanation of the round-robin approach, I think the part I might be missing is this:

Note that the starting parameters of first_name are irrelevant

I did not think they would be irrelevant. My understanding is that by default, the probability_two_random_records_match is fixed. Then EM uses a mixture model, but it needs a proportion of pairs that are in the "match" class, and that proportion isn't probability_two_random_records_match if blocking rules are used for training. What I thought "backing out" meant was that you would calculate the mixture of matches and non-matches in a training session by starting with probability_two_random_records_match and applying the Bayes factor associated with the blocking rule used for training. If that isn't what you are doing, it seems like it would be more correct, no? But it introduces the cyclic problem I mentioned before (you need m probabilities to get m probabilities).

p.s. I'd love to discuss this more and share some of the work we've been doing with Splink! I work with Abraham Flaxman at the Institute for Health Metrics and Evaluation -- I think you've already been in contact. I'll send you an email 😃

RobinL commented 10 months ago

Thanks for the link - looks interesting.

Ah yes - there's some additional context that I didn't mention relating to your (correct) point:

and applying the Bayes factor associated with the blocking rule used for training

To explain: Consider again a model with first_name, surname, and dob.

Again, we are going to block on first_name to estimate the m and u parameters for surname and dob.

The process is as follows:

First, we adjust probability_two_random_records_match upwards to account for the fact we're blocking on first_name, using the existing m and u values for first_name
We then allow EM to estimate the m values. The probability_two_random_records_match is re-computed each iteration and used in the following iteration.
When the algorithm converges with throw away the value of probability_two_random_records_match from the training session.

Note the cyclic problem is avoided for two reasons:

The upward adjustment to probability_two_random_records_match will be primarily driven by the u probability, which we can estimate directly using linker.estimate_u_using_random_sampling
In any case, we allow probability_two_random_records_match to vary during the EM session, so it's starting value doesn't matter very much

You can see this at work using this script:

code to produce logs

```python import logging from splink.datasets import splink_datasets from splink.duckdb.blocking_rule_library import block_on from splink.duckdb.comparison_library import ( exact_match, levenshtein_at_thresholds, ) from splink.duckdb.linker import DuckDBLinker df = splink_datasets.fake_1000 df.head() settings = { "link_type": "dedupe_only", "blocking_rules_to_generate_predictions": [ block_on(["surname"]), block_on(["dob"]), block_on(["first_name"]), ], "comparisons": [ levenshtein_at_thresholds("first_name", 2), exact_match("surname"), exact_match("dob"), ], "max_iterations": 2, } linker = DuckDBLinker(df, settings) deterministic_rules = [block_on(["first_name", "surname"])] linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7) linker.estimate_u_using_random_sampling(max_pairs=1e6) logging.getLogger("splink").setLevel(7) linker.estimate_parameters_using_expectation_maximisation(block_on("first_name")) ```

That outputs:

----- Starting EM training session -----

Original prob two random records match: 0.001
Increasing prob two random records match using first_name - Exact match using bayes factor 457.019

Prob two random records match adjusted for blocking on l."first_name" = r."first_name": 0.273
Estimating the m probabilities of the model by blocking on:
l."first_name" = r."first_name"

Parameter estimates will be made for the following comparison(s):
    - surname
    - dob

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name

Note also that you can capture the history of the parameters in charts like so:

session = linker.estimate_parameters_using_expectation_maximisation(block_on("first_name"))
session.probability_two_random_records_match_iteration_chart()

session.match_weights_interactive_history_chart()

In earlier version of Splink, we used to then 'back out' the 'unadjusted' probability_two_random_records_match as the new estimate, but it turns out that, due to violations of conditional independence, it often gives an answer that's wildly inaccurate.

Finally in terms of how we tested this approach, I have a synthetic dataset with known m and u values and I used this to validate that the correct m and u parameter estimates are made using our process

zmbc commented 10 months ago

Thank you so much for taking the time to explain this!

The key part I did not expect was:

When the algorithm converges we throw away the value of probability_two_random_records_match from the training session.

I had been using fix_probability_two_random_records_match=True, which I thought was necessary because I didn't want to use EM for estimating lambda. If I'm understanding correctly, running the method this way does exhibit the cyclical problem I mentioned (even if that problem isn't huge in most applications -- though I'm not sure I follow why the adjustment is "primarily driven by the u probability" since the adjustment is based on a ratio).

It looks like all I need to do to achieve what I thought I was doing is leave populate_probability_two_random_records_match_from_trained_values=False (which is the default), correct? I do not think the documentation is very explicit about this point -- would you be open to a PR for that?

More generally, my hunch is that Splink's approach works well for the vast majority of applications, but there are probably some pathological cases where it breaks down -- these will be cases where the initial m and u probabilities are not very close to correct, and there are actually more than 2 clusters in the dataset (which is pretty common -- for example, a cluster of pairs of people who do not live together, a cluster of pairs of people in the same household, and a cluster of pairs of records about the same person). Then EM might head down the path of splitting into the wrong dichotomous clusters, if not given better guidance on the lambda parameter. But this is all speculative; I'll see if I can construct a MCVE, now that I have a better grasp on what Splink is doing.

RobinL commented 10 months ago

It's a good point, and yes I'd be very happy to accept a PR to improve!

I guess part of the reason the current docs are bad is that historically, in the original design, we were using the probability_two_random_records_match from the training session. So the API was designed with this in mind. It was only later that we started to throw away the estimate, and found that using estimate_probability_two_random_records_match gave a better result

so:

fix_probability_two_random_records_match prevents it from varying during EM training. I would generally recommend against setting it to True. But it may help convergence in some cases, and may be justified if you're confident that conditional independence holds.
then populate_probability_two_random_records_match_from_trained_values will prevent it being thrown away. Specifically, the population value will be 'backed out', which you can see in the following log:

Original prob two random records match: 0.001
Increasing prob two random records match using first_name - Exact match using bayes factor 163.975

Prob two random records match adjusted for blocking on l."first_name" = r."first_name": 0.119
Estimating the m probabilities of the model by blocking on:
l."first_name" = r."first_name"

Parameter estimates will be made for the following comparison(s):
    - surname
    - dob

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name

Probability two random records match from trained model blocking on l."first_name" = r."first_name": 0.237
Reversing comparison level first_name using bayes factor 163.975
This estimate of probability two random records match now:  0.002 with reciprocal 528.369

---------

Median of prop of matches estimates: 0.002 reciprocal 528.369

script to reproduce

```python import logging from splink.datasets import splink_datasets from splink.duckdb.blocking_rule_library import block_on from splink.duckdb.comparison_library import ( exact_match, levenshtein_at_thresholds, ) from splink.duckdb.linker import DuckDBLinker df = splink_datasets.fake_1000 df.head() settings = { "link_type": "dedupe_only", "blocking_rules_to_generate_predictions": [ block_on(["surname"]), block_on(["dob"]), block_on(["first_name"]), ], "comparisons": [ levenshtein_at_thresholds("first_name", 2), exact_match("surname"), exact_match("dob"), ], "max_iterations": 2, } linker = DuckDBLinker(df, settings) deterministic_rules = [block_on(["first_name", "surname"])] linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7) linker.estimate_u_using_random_sampling(max_pairs=1e6) logging.getLogger("splink").setLevel(5) linker.estimate_parameters_using_expectation_maximisation(block_on("first_name"), populate_probability_two_random_records_match_from_trained_values=True) ```

The reason the adjustment it's driven primarily by the u value is that we're talking about the m and u values for an exact match, so whilst it is a ratio m/u, for plausible values of m, the u value matters much more.

Why?

m for an exact match tends to be 0.9 < m <1.0. Amongst matches, most variables match >90% of the time. There may be some unusual variables where this is not the case, but it would be extremely unusual to observe (say) m < 0.25 on an exact match. As a result, the largest effect m can have is to change the size of m/u by a 4x multiplier (m=0.25 vs m=1.0)
u on the other hand will be anything between around 0.7 (for e.g. gender with an imbalance where most are male), and very small values like 0.00003 for date of birth. That means the effect it has can be anything from small to massive (a multipler of ~1.4x - 33,000x`

In Fellegi Sunter models more generally, the case where m values can have a large effect is for a non match, specifically cases where it's exceedingly rare to observe a non-match amongst matching records. But this is fairly unusual, so most of the time, it's the u value driving the ratio/match weight.

RobinL commented 10 months ago

and there are actually more than 2 clusters in the dataset (which is pretty common -- for example, a cluster of pairs of people who do not live together, a cluster of pairs of people in the same household, and a cluster of pairs of records about the same person). Then EM might head down the path of splitting into the wrong dichotomous clusters, if not given better guidance on the lambda parameter

That's definitely correct and very astute. We've certainly seen cases of problematic EM convergence (multiple local minima) where the entity type is ambiguous. In fact - the example was exactly the one you suggested (household vs person). I guess that's a good example where fix_probability_two_random_records_match = True could be a good idea

Overall the story of the development of Splink has been partly about slowly locking down EM to improve convergence:

originally we used EM to estimate all parameters, but had convergence problems fairly regularly
then we realised that in most cases u is better estimated directly. and realised that fixing u also helps convergence because it prevents u converging to obviously wrong values
then we realised that in most cases probability_two_random_records_match is better estimated directly, and also further helps convergence

with all that said, i'm still somewhat sceptical about how accurately EM estimates the m values. But at the same time, in practice, getting them a bit wrong doesn't matter that much (because it's typically u driving the match weight)

zmbc commented 10 months ago

Thanks for that explanation! It all makes sense to me.

The story of the development of Splink is really interesting. From my perspective, while Splink started as more or less a re-implementation of fastLink, it has since become a package with substantial new methods innovations. I appreciate your focus on how things work in practice, but I also think it would be great to describe a little more comprehensively what those innovations mean on a theoretical level and what assumptions they rely on.

RobinL commented 10 months ago

Closed by #1832

moj-analytical-services / splink

EM silently ignores salting #1817

What happens?

To Reproduce

OS:

Splink version:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Have you tried this on the latest `master` branch?