moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.28k stars 146 forks source link

threshold_match_weight in linker.predict() not checked when equal to 0 #1716

Closed medwar99 closed 5 days ago

medwar99 commented 10 months ago

What happens?

Weird edge case, but when calling linker.predict with a threshold_match_weight of 0 to remove negative weighted records, the threshold isn't applied and it returns all records. There is a logical check on threshold_match_weight but it should check that threshold_match_weight is not None, to handle the case it may be set to zero.

https://github.com/moj-analytical-services/splink/blob/868fefba022483f4cb0d05822aa55d62ee9b92fa/splink/predict.py#L58

To Reproduce

From the rough and ready example, last two lines of interest.

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.blocking_rule_library import block_on
import splink.duckdb.comparison_library as cl

df = splink_datasets.historical_50k

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("full_name"),
        block_on(["substr(full_name,1,6)", "dob", "birth_place"]),
        block_on(["dob", "birth_place"]),
        block_on("postcode_fake"),
    ],
    "comparisons": [
        cl.jaro_at_thresholds("full_name", [0.9, 0.7], term_frequency_adjustments=True),
        cl.levenshtein_at_thresholds("dob", [1, 2]),
        cl.levenshtein_at_thresholds("postcode_fake", 2),
        cl.jaro_winkler_at_thresholds("birth_place", 0.9, term_frequency_adjustments=True),
        cl.exact_match("occupation",  term_frequency_adjustments=True),
    ],
}

linker = DuckDBLinker(df, settings)
deterministic_rules = [
    "l.full_name = r.full_name",
    "l.postcode_fake = r.postcode_fake and l.dob = r.dob",
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)
linker.estimate_u_using_random_sampling(max_pairs=2e7)

results = linker.predict(threshold_match_weight=0) # set threshold weight to 0

assert results.as_pandas_dataframe()['match_weight'].min() >= 0 # Fails if there are still negative weights

OS:

Windows 10

Splink version:

3.9.8

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

ADBond commented 10 months ago

Thanks for the report! If you are happy to feel free to put in a PR with the fix for this. Otherwise someone from the team will have a look into it when there's some space to do so

ADBond commented 5 days ago

Closed by #2152