moj-analytical-services / splink_demos

Interactive notebooks containing demonstration code of the splink library
38 stars 27 forks source link

+ example linking febrl4 (a and b) datasets #73

Closed ADBond closed 2 years ago

ADBond commented 2 years ago

An example notebook linking the febrl4 dataset tables, including a sketch of a typical workflow, using several splink features, a slight comparison of different models, and a reasonable amount of commentary.

ADBond commented 2 years ago

Please do let me know if I've made any errors, or have misunderstood anything/said anything misleading. I tried to avoid making it excessively detailed/large, but I realise there is a reasonable amount in this, so happy to trim down if it's too dense - just thought it might be useful to display a few different features/concepts in additional ways to how they are presented elsewhere.

ADBond commented 2 years ago

(also I realise this currently lacking output, which I believe I should be able to remedy later)

RobinL commented 2 years ago

This is brilliant work.

I noticed a very small typo but other than that good to merge! the aim here is that we generate very true false positives

ThomasHepworth commented 2 years ago

Can we add a new section to 00_Tutorial_Introduction to cover all of these new demos?

End-to-end Demos ☝️or something along those lines.

ThomasHepworth commented 2 years ago

I can't actually add comments on directly into the .ipynb file as it's too large... so hopefully these random comments make sense.

Screenshot 2022-10-07 at 10 53 09 ☝️

For variables that aren't used in the m-training blocking rules, we have two estimates --- one from each of the training sessions (see `street_number`, for example)
ThomasHepworth commented 2 years ago

Screenshot 2022-10-07 at 10 55 06

Chart is broken.

I've noticed something similar happening with another graph and I'm trying to diagnose the issue... not entirely sure what's wrong with it though :(

ThomasHepworth commented 2 years ago

Screenshot 2022-10-07 at 10 58 07

Maybe link through to Robin's articles in case someone is interested in what you mean by ... prevalence of coincidences and mistakes. Articles

ThomasHepworth commented 2 years ago

Spelling error here: Similarly, we can look at the non-links which are performing the best, to see whether we have an issue with falst positives.

ADBond commented 2 years ago

Screenshot 2022-10-07 at 10 55 06

Chart is broken.

I've noticed something similar happening with another graph and I'm trying to diagnose the issue... not entirely sure what's wrong with it though :(

@ThomasHepworth Ah yep, I noticed that also but slipped my mind. Think the issue (in this case) is that when the % of unlinkables is functionally none unlinkables_data returns an empty list, so there is no data to graph.

A rough workaround is something like:

def new_ul_chart(linker, x_col="match_weight", source_dataset=None):
    records = unlinkables_data(linker, x_col)
    if not records:
        records = [{"match_weight": mw, "match_probability": 1/(1 + 2**(-mw)), "prop": 0, "cum_prop": 0} for mw in range(-9, 9)]
    return unlinkables_chart(records, x_col, source_dataset)

but it is a bit of a hack