moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.27k stars 145 forks source link

`NaN` trained values can break `predict()` #2334

Open ADBond opened 4 weeks ago

ADBond commented 4 weeks ago

You can train values so that an m-value ends up with a value of NaN. This then breaks the SQL that is generated in .predict().

Here is a non-elegant reprex, very similar to #2333 (although note that we add in slightly fewer values to our new column compared to there):

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

settings = SettingsCreator(
    "dedupe_only",
    comparisons=[
        cl.LevenshteinAtThresholds("first_name"),
        cl.LevenshteinAtThresholds("surname"),
        cl.ExactMatch("city"),
        cl.LevenshteinAtThresholds("dob"),
        cl.LevenshteinAtThresholds("email"),
        cl.ExactMatch("cluster"),
        cl.ExactMatch("cluster_1"),
        cl.ExactMatch("cluster_2"),
        cl.ExactMatch("cluster_3"),
        cl.ExactMatch("non_match_cat"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("dob"),
        block_on("city"),
    ]
)

df = splink_datasets.fake_1000
df["cluster_1"] = df["cluster"]
df["cluster_2"] = df["cluster"]
df["cluster_3"] = df["cluster"]

# specially chosen non-matchy things
df["non_match_cat"] = None
# we add fewer values than in issue #2333
cats = {
    263: 6,
    273: 6,
    500: 7,
    729: 7,
}
for id_n, cat in cats.items():
    df["non_match_cat"][df["unique_id"] == id_n] = cat

db_api = DuckDBAPI()
linker = Linker(df, settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    block_on("first_name", "surname", "dob"), recall=0.7
)
linker.training.estimate_u_using_random_sampling(max_pairs=1e8)

linker.training.estimate_parameters_using_expectation_maximisation(block_on("cluster"))
linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
linker.training.estimate_parameters_using_expectation_maximisation(block_on("city"))

linker.misc.save_model_to_json("nan-model.json", overwrite=True)

linker.inference.predict()

We get an error Referenced column "nan" not found in FROM clause! or similar, much as in #852.

ADBond commented 4 weeks ago

In fact we can of course push this error to trigger in training by replacing linker.inference.predict() in the example above with linker.training.estimate_parameters_using_expectation_maximisation(block_on("email")).

@lamaeldo - possibly related to the issue in your comment on the above-mentioned issue