Closed ADBond closed 2 years ago
Please do let me know if I've made any errors, or have misunderstood anything/said anything misleading. I tried to avoid making it excessively detailed/large, but I realise there is a reasonable amount in this, so happy to trim down if it's too dense - just thought it might be useful to display a few different features/concepts in additional ways to how they are presented elsewhere.
(also I realise this currently lacking output, which I believe I should be able to remedy later)
This is brilliant work.
I noticed a very small typo but other than that good to merge!
the aim here is that we generate very true false positives
Can we add a new section to 00_Tutorial_Introduction to cover all of these new demos?
End-to-end Demos ☝️or something along those lines.
I can't actually add comments on directly into the .ipynb
file as it's too large... so hopefully these random comments make sense.
☝️
For variables that aren't used in the m-training blocking rules, we have two estimates --- one from each of the training sessions (see `street_number`, for example)
Chart is broken.
I've noticed something similar happening with another graph and I'm trying to diagnose the issue... not entirely sure what's wrong with it though :(
Maybe link through to Robin's articles in case someone is interested in what you mean by ... prevalence of coincidences and mistakes
. Articles
Spelling error here:
Similarly, we can look at the non-links which are performing the best, to see whether we have an issue with falst positives.
Chart is broken.
I've noticed something similar happening with another graph and I'm trying to diagnose the issue... not entirely sure what's wrong with it though :(
@ThomasHepworth Ah yep, I noticed that also but slipped my mind. Think the issue (in this case) is that when the % of unlinkables is functionally none unlinkables_data
returns an empty list, so there is no data to graph.
A rough workaround is something like:
def new_ul_chart(linker, x_col="match_weight", source_dataset=None):
records = unlinkables_data(linker, x_col)
if not records:
records = [{"match_weight": mw, "match_probability": 1/(1 + 2**(-mw)), "prop": 0, "cum_prop": 0} for mw in range(-9, 9)]
return unlinkables_chart(records, x_col, source_dataset)
but it is a bit of a hack
An example notebook linking the febrl4 dataset tables, including a sketch of a typical workflow, using several splink features, a slight comparison of different models, and a reasonable amount of commentary.