moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.34k stars 147 forks source link

[FEAT] EM algorithm performance improvement #1363

Closed aymonwuolanne closed 1 year ago

aymonwuolanne commented 1 year ago

Hi all, I've been trying out splink for some larger scale probabilistic linkages at the Australian Bureau of Statistics (along with @nerskin). I thought you'd be interested to hear about a potential efficiency improvement.

Is your proposal related to a problem?

The iterations of the EM algorithm are slower than they need to be (with some caveats).

Describe the solution you'd like

At the moment, the E step of each iteration runs predict on the entire blocked dataframe with the updated m and u parameters. If you ignore term frequency adjustments for the EM estimation, you can do all this work on an aggregated dataset which counts the number of unique agreement patterns (in SQL it's just a simple GROUP BY all the gamma columns). The counting of the agreement patterns only needs to happen at the beginning of the EM estimation, rather than at each iteration. This is significantly more efficient, and the runtime depends only on the number of unique agreement patterns rather than the number of records in the dataset.

I've tested an implementation of this idea and it achieves the speedup I was expecting, you can check it out here: https://github.com/aymonwuolanne/splink/tree/EM-speedup.

Describe alternatives you've considered

Additional context

This idea comes from the Fastlink paper (Enamorado, Fifield, and Imai 2019):

First, for implementing the E step, notice that the match probability given in equation (5) takes the same values for two pairs if their agreement patterns are identical... Thus, the E step can be implemented by computing the match probability for each of the realized agreement patterns. Often, the total number of realized agreement patterns is much smaller than the number of all possible agreement patterns.

Second, the M step defined in equations (S1) and (S2) requires the summation of match probabilities across all pairs of their subset. Because this probability is identical within each agreement pattern, all we have to do is to count the total number of pairs that have each agreement pattern.

Note: this approach slightly changes the estimation of m and u parameters when term frequency adjustments are used. Since the term frequency adjustment used in calculating the match probability depends on more than just the gamma columns, this trick does not apply. However, in the Fastlink paper it is said that term frequency adjustments are intended to be an ex-post adjustment, where the parameter estimation occurs without the adjustments applied, and the term frequency adjustments apply to the final prediction step.

From what I've tested, the estimation of the parameters seems very close to the original method anyway.

RobinL commented 1 year ago

Thanks so much for this, it looks great! Really interesting idea to use this for efficiency at the EM step.

I'd certainly be interested in including this in Splink - would you consider working with us to do a PR? In particular, if you can PR in a working version, I'm happy to add in tests and think about whether we need to make any adjustments to the spark caching/execution/materialisation pipeline. (Of course, you're welcome to do this if you'd like!!)

Let me try and summarise some thoughts on the (fairly long!) history of this, and how we could include it in Splink

History

Thoughts on including into Splink

I would suggest therefore that efficiency this is included as option that is off by default - i.e. a argument to the estimate_parameters_using_expectation_maximisation function, which the user can set to true if theyr'e experiencing slow runtimes.

We could also write logic to enable it by default in the case that the user is running a model which has no tf adjustments

Does that make sense to you?

Finally, to note this should be quite easy to write tests for, because we can:

aymonwuolanne commented 1 year ago

Thanks for the detailed response Robin! The history behind this is interesting. I agree that having it as optional behaviour is a good idea - the estimation is fast enough in smaller cases.

I'll put a pull request in later today.

RobinL commented 1 year ago

Closed by #1369