[FEAT] EM algorithm performance improvement

aymonwuolanne commented 1 year ago

Hi all, I've been trying out splink for some larger scale probabilistic linkages at the Australian Bureau of Statistics (along with @nerskin). I thought you'd be interested to hear about a potential efficiency improvement.

Is your proposal related to a problem?

The iterations of the EM algorithm are slower than they need to be (with some caveats).

Describe the solution you'd like

At the moment, the E step of each iteration runs predict on the entire blocked dataframe with the updated m and u parameters. If you ignore term frequency adjustments for the EM estimation, you can do all this work on an aggregated dataset which counts the number of unique agreement patterns (in SQL it's just a simple GROUP BY all the gamma columns). The counting of the agreement patterns only needs to happen at the beginning of the EM estimation, rather than at each iteration. This is significantly more efficient, and the runtime depends only on the number of unique agreement patterns rather than the number of records in the dataset.

I've tested an implementation of this idea and it achieves the speedup I was expecting, you can check it out here: https://github.com/aymonwuolanne/splink/tree/EM-speedup.

Describe alternatives you've considered

Additional context

This idea comes from the Fastlink paper (Enamorado, Fifield, and Imai 2019):

First, for implementing the E step, notice that the match probability given in equation (5) takes the same values for two pairs if their agreement patterns are identical... Thus, the E step can be implemented by computing the match probability for each of the realized agreement patterns. Often, the total number of realized agreement patterns is much smaller than the number of all possible agreement patterns.

Second, the M step defined in equations (S1) and (S2) requires the summation of match probabilities across all pairs of their subset. Because this probability is identical within each agreement pattern, all we have to do is to count the total number of pairs that have each agreement pattern.

Note: this approach slightly changes the estimation of m and u parameters when term frequency adjustments are used. Since the term frequency adjustment used in calculating the match probability depends on more than just the gamma columns, this trick does not apply. However, in the Fastlink paper it is said that term frequency adjustments are intended to be an ex-post adjustment, where the parameter estimation occurs without the adjustments applied, and the term frequency adjustments apply to the final prediction step.

From what I've tested, the estimation of the parameters seems very close to the original method anyway.

RobinL commented 1 year ago

Thanks so much for this, it looks great! Really interesting idea to use this for efficiency at the EM step.

I'd certainly be interested in including this in Splink - would you consider working with us to do a PR? In particular, if you can PR in a working version, I'm happy to add in tests and think about whether we need to make any adjustments to the spark caching/execution/materialisation pipeline. (Of course, you're welcome to do this if you'd like!!)

Let me try and summarise some thoughts on the (fairly long!) history of this, and how we could include it in Splink

History

The first versions of Splink used the ex-post term frequency adjustment as you describe in the fastlink paper.
However, we found that whilst their approach is theoretically valid, in practice there are empirical problems, described here
So we moved to the approach of including the tf calculations in both the iterations and the final predict (because once you move away from the ex-post adjustment, the logic is the same in both EM and the final predict())
We've also always been keen to support term frequency adjustments on multiple columns.
As you've realised, at the predict stage, the efficiency you've found is not so useful with multiple tf adjustment columns and adds a lot of complexity
However, you're absolutely right that there may be a big win here at the EM training stage. The possibility of doing it here hadn't occurred to me and I think it's a really good idea.

Thoughts on including into Splink

Including tf adjustments in EM training will improve accuracy, but as you say, in many real-world scenarios the difference may be very marginal
In many applications of splink on relatively small datasets (e.g. <100k records), training is very fast anyway, so you 'might as well' do the full EM-with-tf calculation

I would suggest therefore that efficiency this is included as option that is off by default - i.e. a argument to the estimate_parameters_using_expectation_maximisation function, which the user can set to true if theyr'e experiencing slow runtimes.

We could also write logic to enable it by default in the case that the user is running a model which has no tf adjustments

Does that make sense to you?

Finally, to note this should be quite easy to write tests for, because we can:

run a model with no tf adjustments using the current inefficient approach
run the same model with the efficiency option enabled assert they give the same results

aymonwuolanne commented 1 year ago

Thanks for the detailed response Robin! The history behind this is interesting. I agree that having it as optional behaviour is a good idea - the estimation is fast enough in smaller cases.

I'll put a pull request in later today.

RobinL commented 1 year ago

Closed by #1369

moj-analytical-services / splink