Closed aymonwuolanne closed 1 year ago
Thanks so much for this, it looks great! Really interesting idea to use this for efficiency at the EM step.
I'd certainly be interested in including this in Splink - would you consider working with us to do a PR? In particular, if you can PR in a working version, I'm happy to add in tests and think about whether we need to make any adjustments to the spark caching/execution/materialisation pipeline. (Of course, you're welcome to do this if you'd like!!)
Let me try and summarise some thoughts on the (fairly long!) history of this, and how we could include it in Splink
predict()
)I would suggest therefore that efficiency this is included as option that is off by default - i.e. a argument to the estimate_parameters_using_expectation_maximisation function, which the user can set to true if theyr'e experiencing slow runtimes.
We could also write logic to enable it by default in the case that the user is running a model which has no tf adjustments
Does that make sense to you?
Finally, to note this should be quite easy to write tests for, because we can:
Thanks for the detailed response Robin! The history behind this is interesting. I agree that having it as optional behaviour is a good idea - the estimation is fast enough in smaller cases.
I'll put a pull request in later today.
Closed by #1369
Hi all, I've been trying out splink for some larger scale probabilistic linkages at the Australian Bureau of Statistics (along with @nerskin). I thought you'd be interested to hear about a potential efficiency improvement.
Is your proposal related to a problem?
The iterations of the EM algorithm are slower than they need to be (with some caveats).
Describe the solution you'd like
At the moment, the E step of each iteration runs
predict
on the entire blocked dataframe with the updated m and u parameters. If you ignore term frequency adjustments for the EM estimation, you can do all this work on an aggregated dataset which counts the number of unique agreement patterns (in SQL it's just a simple GROUP BY all the gamma columns). The counting of the agreement patterns only needs to happen at the beginning of the EM estimation, rather than at each iteration. This is significantly more efficient, and the runtime depends only on the number of unique agreement patterns rather than the number of records in the dataset.I've tested an implementation of this idea and it achieves the speedup I was expecting, you can check it out here: https://github.com/aymonwuolanne/splink/tree/EM-speedup.
Describe alternatives you've considered
Additional context
This idea comes from the Fastlink paper (Enamorado, Fifield, and Imai 2019):
Note: this approach slightly changes the estimation of m and u parameters when term frequency adjustments are used. Since the term frequency adjustment used in calculating the match probability depends on more than just the gamma columns, this trick does not apply. However, in the Fastlink paper it is said that term frequency adjustments are intended to be an ex-post adjustment, where the parameter estimation occurs without the adjustments applied, and the term frequency adjustments apply to the final prediction step.
From what I've tested, the estimation of the parameters seems very close to the original method anyway.