moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.27k stars 145 forks source link

[FEAT] Allow exact or Bayesian pre-specification of m-probabilities for selected comparisons #2068

Closed samkodes closed 6 days ago

samkodes commented 5 months ago

Is your proposal related to a problem?

We may sometimes have prior knowledge about the m-probabilities for one or more comparisons in a model, but there is currently no way to specify this knowledge. We can set initial values for m-probabilities for EM, but have no control over where the EM will take them.

We may be absolutely certain about these m-probabilities, or have a fuzzier sense of their approximate value; in the latter case a Bayesian approach may be helpful.

This scenario is different from a fully-specified model in which we have exact knowledge of all m-probabilities. Here, I am assuming we only have partial knowledge, or knowledge only about some of them.

Describe the solution you'd like

For each comparison in a settings object, independently, allow the user to specify either: (1) exact m-probabilities for each comparison level, which will be held fixed during EM, or (2) a Dirichlet prior for the comparison (i.e. a distribution over the m-probabilities for all levels), which will be updated during EM via the conjugate posterior updating rule and whose posterior mean will be used for predictions at each round (the posterior mean i.e. categorical distribution is actually the full posterior predictive distribution for a single observation of the dirichlet-multinomial, so this is not a simplification) . See https://github.com/moj-analytical-services/splink/discussions/2023#discussioncomment-8807967 for a sketch.

The Dirichlet prior approach is slightly "squishier" and allows prior knowledge to be overwhelmed / adjusted by observed data obtained through the EM process.

Describe alternatives you've considered

We can currently set initial values for m-probabilities for EM, but have no control over where the EM will take them.

Manually updating m-probabilities after model fitting for prediction is an alternative, but means that known values are not used in EM to inform the m-probabilities for other comparisons.

(It might be nice to specify m-probabilities directly for only a subset of levels. This is trivial for manual updating for the sake of prediction, because there is no enforcement of the rule that m-probabilities have to sum to 1. But I'm not sure how this could/should be done consistently in the EM because of the sum-to-1 constraint.)

Additional context

May be helpful if multi-class models are adopted to help distinguish classes. https://github.com/moj-analytical-services/splink/discussions/2023#discussioncomment-8807967 for some details.

RobinL commented 5 months ago

Thanks - lots of interesting ideas here.

I'll comment here on the simpler part because it should be relatively straightforward to implement : allowing the user to fix m or u probabilities on any ComparisonLevel which then do not vary during EM training. Hence allowing some control over 'guiding' EM training.

It also feels like something that should be allowed - we've just never got around to implementing.

In Splink 4, we have a new and more general syntax for configuring. each ComparisonLevel, so it'd look something like this:

cll.ExactMatchLevel("hello").configure(
    m_probability=0.9, fix_m_during_training=True
)
{'sql_condition': '"hello_l" = "hello_r"',
 'label_for_charts': 'Exact match on hello',
 'm_probability': 0.9,
 'fix_m_during_training`: True}

In terms of where to look in the codebase, you've might have worked this out already but relevant parts may be: m_probability _populate_m_u_from_trained_values

I agree the 'sum to 1' constraint is a potentially fiddly aspect to this!

samkodes commented 5 months ago

Thanks - looks a lot quicker than the old way and makes it easy to extend with new options. Probably OK to set m-probs individually and trust user to make sum-to-1 - I'm not sure if violating that will have any effect on EM.

How about a similar ability to set generic properties for the Comparison configurator? For example, a Dirichlet prior could be set this way... then all it would take for the Bayesian approach would be a few changes to compute_new_parameters_sql to implement the conjugate posterior updating rule (just add some extra "virtual observation" counts - i.e. the corresponding prior parameter - to each comparison level with a CASE statement).

samkodes commented 5 months ago

I came across an alternative that may also be useful and should be relatively easy to implement. The approach is in an old paper by Winkler, Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage (1993).

The basic idea is to allow the user to specify a set of convex constraints on the parameters to be estimated by EM. While the idea was driven by experiments with non-independent (e,g loglinear) models, they could also be applied to simpler independent models. An example of a convex constraint that could be easily specified and which would be useful is a set of linear inequalities on the m-probabilities; for example requiring "all others" probabilities for some comparisons to be < 20%.

The constraint is enforced during each round of EM by checking to ensure that the new estimate for each parameter is in the allowed region. If not, we choose a point on the line segment connecting it and the previous estimate; for the likelihoods under consideration, a theorem guarantees that the likelihood will be greater than that for the prior estimate for any point on the line segment. For simplicity, we can choose the point on the line segment on the boundary of the allowed region. Since the likelihood has increased, the EM should still converge with this process.

The constraint checking and line search for the boundary of the region could be done pretty easily in Python in the EM loop.