Closed zmbc closed 10 months ago
Thanks for the report. You're right, salting should be respected during EM training but currently it is not. We should fix this.
The reason we haven't respected this option in the past is that salting was seldom used at all, and further, it's less important for EM training than for predict()
, since often EM training runs create fewer pairs.
But you're totally right, there are some instances in which it may be of value. Especially since we've recently found that salting can very substantially speed up duckdb workflows.
Interesting. I had not been using a very restrictive blocking rule for training.
Is there documentation somewhere of why splink recommends this approach? I haven't come across any methods papers that describe it, and I'd be concerned in particular with how it could introduce additional reliance on conditional independence, and how "backing out" the blocking rule works when the relevant parameters for the blocking columns haven't yet been estimated (seems like a bit of a circular problem).
Hiya. The best docs are here, but they're not great: https://moj-analytical-services.github.io/splink/topic_guides/blocking/model_training.html https://moj-analytical-services.github.io/splink/topic_guides/blocking/predictions.html
You're completely right, though - the approach we use to EM training where we block on a subset of variables, and then train the parameters of the remaining variables, is susceptible to bias when the conditional indepdence assumption is violated (i.e. in most real-world examples!).
One way to mitigate this to a certain extent is to train the m params multiple times and look for stability.
So for example, to train a first_name variable, you could do an EM training run blocking on:
and check that the trained m parameters are not that different using this: https://moj-analytical-services.github.io/splink/charts/parameter_estimate_comparisons_chart.html?h=parameter+estimate
One final thing to say is that we've found in practice that accurately is not that sensitive to m values. So long as they are ballpark correct you usually get pretty good prediction results.
is susceptible to bias when the conditional indepdence assumption is violated
But it's more than that, right? If I create a new model and run a training session blocking on surname and dob, the m parameters for surname and dob themselves haven't been estimated yet. So I assume the "backing out" is just done with the default initial values, which could be quite inaccurate.
My understanding is that Splink more or less intends to be "fastLink at scale" -- so I'm curious why Splink doesn't implement EM based on a random sample of pairs, instead of an exhaustive enumeration of pairs meeting some criteria. In the fastLink paper they recommend running EM on only 5% of the dataset instead of the whole thing, and report "nearly indistinguishable" parameter estimates.
I'll start by saying you might be right - a simple sampling approach may be best - we haven't tried it and we should. At a minimum, it should be an option.
Let me try and explain why we didn't go down that route:
Finally, I'll try to clarify the point re 'susceptible to bias '. I'm not sure I agree with your point, but bear with me as it's a little tricky to explain.
Consider a model with first_name
, surname
, and dob
Let's suppose we want to estimate the m
(and optionally u
) probabilities for surname
and dob
(but not first_name
). Suppose we have infinite data and processing power. Under (and only under) the conditional independence assumption, the following two estimation procedures will give the same results:
first_name
Using EM with sampled blocking will also give essentially the same result (less precision, but no bias)
Note that the starting parameters of first_name
are irrelevant because EM converges to the correct (population) parameters for the surname
and dob
m
and u
estimates.
This is what leads to our recommended 'round robin' approach, where you e.g. block on first_name
, to get the suranme
and dob
estimates, and then block on e.g. surname
to get the dob
and first_name
estimates. You can get multiple estimate (e.g. here you get two estimates for dob
). These are averaged
Finally, on 'backing out' (probably this term is ambiguous, so I might be misinterpreting what you mean by it)
probability_two_random_records_match
). However, in practice we found this didn't work well, so we switched to using a direct estimation procedueSee: https://github.com/moj-analytical-services/splink/issues/462 https://github.com/moj-analytical-services/splink/pull/734
p.s. I get the sense you know at least as much as we do about all of this! Would be happy to have a chat (and would be interesting to hear your use case). Feel free to reach out at robinlinacre@hotmail.com if you're interested.
When data becomes large, you need a very large sample.
Yes, and the fastLink paper does not address this. This was my guess of why you headed down this road, and it seems reasonable. But also under-theorized, and not adequately (to my mind) highlighted as such.
I am aware of one other approach to dealing with this problem, which I read in this paper: https://doi.org/10.2478/jos-2020-0039. Essentially, it uses a weighted form of random sampling, and also proposes using the most and least discordant pairs (which heavily relies on conditional independence). I think you could do either or both of these things. There are probably other approaches too.
Re: your explanation of the round-robin approach, I think the part I might be missing is this:
Note that the starting parameters of first_name are irrelevant
I did not think they would be irrelevant. My understanding is that by default, the probability_two_random_records_match
is fixed. Then EM uses a mixture model, but it needs a proportion of pairs that are in the "match" class, and that proportion isn't probability_two_random_records_match
if blocking rules are used for training. What I thought "backing out" meant was that you would calculate the mixture of matches and non-matches in a training session by starting with probability_two_random_records_match
and applying the Bayes factor associated with the blocking rule used for training. If that isn't what you are doing, it seems like it would be more correct, no? But it introduces the cyclic problem I mentioned before (you need m probabilities to get m probabilities).
p.s. I'd love to discuss this more and share some of the work we've been doing with Splink! I work with Abraham Flaxman at the Institute for Health Metrics and Evaluation -- I think you've already been in contact. I'll send you an email 😃
Thanks for the link - looks interesting.
Ah yes - there's some additional context that I didn't mention relating to your (correct) point:
and applying the Bayes factor associated with the blocking rule used for training
To explain: Consider again a model with first_name
, surname
, and dob
.
Again, we are going to block on first_name
to estimate the m
and u
parameters for surname
and dob
.
The process is as follows:
probability_two_random_records_match
upwards to account for the fact we're blocking on first_name
, using the existing m and u values for first_name
m
values. The probability_two_random_records_match
is re-computed each iteration and used in the following iteration.probability_two_random_records_match
from the training session.Note the cyclic problem is avoided for two reasons:
probability_two_random_records_match
will be primarily driven by the u
probability, which we can estimate directly using linker.estimate_u_using_random_sampling
probability_two_random_records_match
to vary during the EM session, so it's starting value doesn't matter very muchYou can see this at work using this script:
That outputs:
----- Starting EM training session -----
Original prob two random records match: 0.001
Increasing prob two random records match using first_name - Exact match using bayes factor 457.019
Prob two random records match adjusted for blocking on l."first_name" = r."first_name": 0.273
Estimating the m probabilities of the model by blocking on:
l."first_name" = r."first_name"
Parameter estimates will be made for the following comparison(s):
- surname
- dob
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- first_name
Note also that you can capture the history of the parameters in charts like so:
session = linker.estimate_parameters_using_expectation_maximisation(block_on("first_name"))
session.probability_two_random_records_match_iteration_chart()
session.match_weights_interactive_history_chart()
In earlier version of Splink, we used to then 'back out' the 'unadjusted' probability_two_random_records_match
as the new estimate, but it turns out that, due to violations of conditional independence, it often gives an answer that's wildly inaccurate.
Finally in terms of how we tested this approach, I have a synthetic dataset with known m and u values and I used this to validate that the correct m and u parameter estimates are made using our process
Thank you so much for taking the time to explain this!
The key part I did not expect was:
When the algorithm converges we throw away the value of probability_two_random_records_match from the training session.
I had been using fix_probability_two_random_records_match=True
, which I thought was necessary because I didn't want to use EM for estimating lambda. If I'm understanding correctly, running the method this way does exhibit the cyclical problem I mentioned (even if that problem isn't huge in most applications -- though I'm not sure I follow why the adjustment is "primarily driven by the u probability" since the adjustment is based on a ratio).
It looks like all I need to do to achieve what I thought I was doing is leave populate_probability_two_random_records_match_from_trained_values=False
(which is the default), correct? I do not think the documentation is very explicit about this point -- would you be open to a PR for that?
More generally, my hunch is that Splink's approach works well for the vast majority of applications, but there are probably some pathological cases where it breaks down -- these will be cases where the initial m and u probabilities are not very close to correct, and there are actually more than 2 clusters in the dataset (which is pretty common -- for example, a cluster of pairs of people who do not live together, a cluster of pairs of people in the same household, and a cluster of pairs of records about the same person). Then EM might head down the path of splitting into the wrong dichotomous clusters, if not given better guidance on the lambda parameter. But this is all speculative; I'll see if I can construct a MCVE, now that I have a better grasp on what Splink is doing.
It's a good point, and yes I'd be very happy to accept a PR to improve!
I guess part of the reason the current docs are bad is that historically, in the original design, we were using the probability_two_random_records_match
from the training session. So the API was designed with this in mind. It was only later that we started to throw away the estimate, and found that using estimate_probability_two_random_records_match
gave a better result
so:
fix_probability_two_random_records_match
prevents it from varying during EM training. I would generally recommend against setting it to True. But it may help convergence in some cases, and may be justified if you're confident that conditional independence holds.populate_probability_two_random_records_match_from_trained_values
will prevent it being thrown away. Specifically, the population value will be 'backed out', which you can see in the following log:Original prob two random records match: 0.001
Increasing prob two random records match using first_name - Exact match using bayes factor 163.975
Prob two random records match adjusted for blocking on l."first_name" = r."first_name": 0.119
Estimating the m probabilities of the model by blocking on:
l."first_name" = r."first_name"
Parameter estimates will be made for the following comparison(s):
- surname
- dob
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- first_name
Probability two random records match from trained model blocking on l."first_name" = r."first_name": 0.237
Reversing comparison level first_name using bayes factor 163.975
This estimate of probability two random records match now: 0.002 with reciprocal 528.369
---------
Median of prop of matches estimates: 0.002 reciprocal 528.369
The reason the adjustment it's driven primarily by the u value is that we're talking about the m
and u
values for an exact match, so whilst it is a ratio m/u
, for plausible values of m
, the u value matters much more.
Why?
m
for an exact match tends to be 0.9 < m <1.0
. Amongst matches, most variables match >90% of the time. There may be some unusual variables where this is not the case, but it would be extremely unusual to observe (say) m < 0.25
on an exact match. As a result, the largest effect m can have is to change the size of m/u
by a 4x multiplier (m=0.25
vs m=1.0
)u
on the other hand will be anything between around 0.7
(for e.g. gender with an imbalance where most are male), and very small values like 0.00003
for date of birth. That means the effect it has can be anything from small to massive (a multipler of ~1.4x - 33,000x`In Fellegi Sunter models more generally, the case where m
values can have a large effect is for a non match, specifically cases where it's exceedingly rare to observe a non-match amongst matching records. But this is fairly unusual, so most of the time, it's the u
value driving the ratio/match weight.
and there are actually more than 2 clusters in the dataset (which is pretty common -- for example, a cluster of pairs of people who do not live together, a cluster of pairs of people in the same household, and a cluster of pairs of records about the same person). Then EM might head down the path of splitting into the wrong dichotomous clusters, if not given better guidance on the lambda parameter
That's definitely correct and very astute. We've certainly seen cases of problematic EM convergence (multiple local minima) where the entity type is ambiguous. In fact - the example was exactly the one you suggested (household vs person). I guess that's a good example where fix_probability_two_random_records_match = True
could be a good idea
Overall the story of the development of Splink has been partly about slowly locking down EM to improve convergence:
u
is better estimated directly. and realised that fixing u also helps convergence because it prevents u converging to obviously wrong valuesprobability_two_random_records_match
is better estimated directly, and also further helps convergencewith all that said, i'm still somewhat sceptical about how accurately EM estimates the m
values. But at the same time, in practice, getting them a bit wrong doesn't matter that much (because it's typically u
driving the match weight)
Thanks for that explanation! It all makes sense to me.
The story of the development of Splink is really interesting. From my perspective, while Splink started as more or less a re-implementation of fastLink, it has since become a package with substantial new methods innovations. I appreciate your focus on how things work in practice, but I also think it would be great to describe a little more comprehensively what those innovations mean on a theoretical level and what assumptions they rely on.
Closed by #1832
What happens?
The
estimate_parameters_using_expectation_maximisation
method takes a blocking rule, but if you pass in a blocking rule that includes "salting_partitions" it is silently ignored.This is because the SQL provided by the user is extracted here, not the SQL that would be generated:
https://github.com/moj-analytical-services/splink/blob/ac1cf23d41215492b68d56841c46de3bd93c7e99/splink/linker.py#L1655
I am not sure whether this should be an error, or whether EM should actually use salting. It seems like the challenge of unbalanced partitions would come up in the EM case; unsure whether salting is an appropriate solution.
To Reproduce
prints
1
.OS:
Linux
Splink version:
3.9.9
Have you tried this on the latest
master
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?