moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.34k stars 147 forks source link

Clustering gives the wrong answer if debug_mode is on #2480

Open ADBond opened 1 week ago

ADBond commented 1 week ago

If we have debug_mode on when we cluster, we get incorrect results. Most nodes are missing, and the nodes that we do have each appear twice.

See:

import duckdb
import pandas as pd

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on

df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("dob"),
        block_on("first_name", "surname"),
    ],
    retain_intermediate_calculation_columns=True,
)

for debug_mode in False, True:
    con = duckdb.connect()
    db_api = DuckDBAPI(con)

    linker = Linker(df, settings, db_api)
    db_api.debug_mode = debug_mode

    df_e = linker.inference.predict(0.8)
    df_c = linker.clustering.cluster_pairwise_predictions_at_threshold(df_e, 0.9)

    con.sql(
        f"SELECT count(*) cnt, cluster_id FROM {df_c.physical_name} GROUP BY cluster_id ORDER BY cnt DESC"
    ).show()

    df_c_ordered = db_api.sql_to_splink_dataframe_checking_cache(
        f"SELECT * FROM {df_c.physical_name} ORDER BY cluster_id, unique_id",
        "clusters_ordered",
        use_cache=False,
    )

    df_c_ordered.to_csv(f"tmp_clusters_{debug_mode}.csv", overwrite=True)

Splink 4.0.4

Not looked into what is causing this yet

ADBond commented 1 week ago

It seems to be something to do with that dataset, as things appear fine if we use splink_datasets.fake_1000. Or quite possibly I am doing something stupid.

ADBond commented 3 days ago

Introduced since 4.0.2

ADBond commented 3 days ago

Alright, so I think the issue here is reusing templated names vs physical names. Only affects clustering with multiple iterations. So instead of SQL like:

select node_id as node_id, representative as cluster_id
            from __splink__representatives_stable_16f74d750 UNION ALL select node_id as node_id, representative as cluster_id
            from __splink__representatives_stable_f92f88f19 UNION ALL select node_id as node_id, representative as cluster_id
            from __splink__representatives_stable_67585d97a UNION ALL select node_id as node_id, representative as cluster_id
            from __splink__representatives_stable_8c1eef1c6 UNION ALL select node_id as node_id, representative as cluster_id
            from __splink__df_representatives_4_6f14fb719

in debug mode we instead get

select node_id as node_id, representative as cluster_id
            from __splink__representatives_stable UNION ALL select node_id as node_id, representative as cluster_id
            from __splink__representatives_stable UNION ALL select node_id as node_id, representative as cluster_id
            from __splink__representatives_stable UNION ALL select node_id as node_id, representative as cluster_id
            from __splink__representatives_stable UNION ALL select node_id as node_id, representative as cluster_id
            from __splink__df_representatives_4

So instead of getting stable nodes from all iterations, we end up with (number of iterations) copies of the final set of stable nodes.