moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.33k stars 148 forks source link

`debug_mode` breaks training workflow #2428

Open ADBond opened 3 weeks ago

ADBond commented 3 weeks ago

If we turn on debug_mode, u-training breaks, but only if we have already run estimate_probability_two_random_records_match. We get Binder Error: Table "l" does not have a column named "__splink_salt"

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.NameComparison("first_name"),
        cl.JaroAtThresholds("surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.DamerauLevenshteinAtThresholds("city").configure(
            term_frequency_adjustments=True
        ),
        cl.EmailComparison("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ],
)

linker = Linker(df, settings, db_api)
db_api.debug_mode = True
linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")],
    recall=0.7,
)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

This is to do with us materialising a table when salting wasn't required, but then trying to access it when salting is required.

I have a weird sense of déjà vu with this - maybe came across it another time in the last few weeks? But couldn't find any other issue, so opening one here.

RobinL commented 2 weeks ago

Took a quick look and haven't got anywhere but this is probably useful when we get more time to have a proper look

Script to help understand debug mode to pinpoint issues ```python import logging import duckdb from splink import DuckDBAPI from splink.internals.duckdb.dataframe import DuckDBDataFrame from splink.internals.pipeline import CTEPipeline con = duckdb.connect() db_api = DuckDBAPI(connection=con) db_api.debug_mode = True # logging.basicConfig(format="%(message)s") # logging.getLogger("splink").setLevel(1) sql = """ CREATE TABLE input_data_abc AS SELECT 'john' AS first_name, 'doe' AS surname UNION ALL SELECT 'jane', 'smith' UNION ALL SELECT 'jim', 'johnson' """ con.execute(sql) input_sdf_1 = DuckDBDataFrame( templated_name="input_data", physical_name="input_data_abc", db_api=db_api ) pipeline = CTEPipeline([input_sdf_1]) sql = f""" select upper(first_name) as first_name from {input_sdf_1.templated_name} """ pipeline.enqueue_sql(sql, "input_upper") # In debug mode, this is run and an output table called 'input_upper' # is created. sql = """ select substr(first_name, 1, 2) as first_name_substr from input_upper """ pipeline.enqueue_sql(sql, "__splink__output_table") output_sdf = db_api.sql_pipeline_to_splink_dataframe(pipeline) output_sdf.as_duckdbpyrelation() ```
ADBond commented 2 weeks ago

Iirc the issue is basically this:

I think the actual issue should be relatively easy to sort out, but I wonder if a wider change might be warranted. Maybe changing how (or if) caching works when we're in debug mode

RobinL commented 2 weeks ago

One thing (probably separate from that specific issue) is that I think we should be outputting the intermediate tables with a unique name (e.g. with asci_uuid), and then passing them into the next step. Doesn't feel very robust to physically instantitiate tables with very generic names to the db.