moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.4k stars 151 forks source link

With Splink 4.0.5, cluster_pairwise_predictions_at_threshold(...) throws an error if df_predict is empty #2506

Closed thibault1024 closed 2 weeks ago

thibault1024 commented 3 weeks ago

What happens?

First of all, thanks a lot for your work with Splink !

I am a Splink afficionado and I am currently updating my Splink based application from 3.9.14 to 4.0.5.

No real problems to perform this update except this difference of behavior I noticed related to the cluster_pairwise_predictions_at_threshold(...) method.

In Splink 3, when this method is called with an empty df_predictdataframe, clusters with one item are returned.

In Splink 4, when this method is called with an empty df_predictdataframe, the following error occurs:

Traceback (most recent call last):
  File "/test.py", line 32, in test
    df_cluster = linker.clustering.cluster_pairwise_predictions_at_threshold(
  File "/opt/miniconda3/envs/linkage-clustering-linkage-splink2/lib/python3.9/site-packages/splink/internals/linker_components/clustering.py", line 92, in cluster_pairwise_predictions_at_threshold
    c.unquote().name for c in df_predict.columns
  File "/opt/miniconda3/envs/linkage-clustering-linkage-splink2/lib/python3.9/site-packages/splink/internals/duckdb/dataframe.py", line 23, in columns
    d = self.as_record_dict(1)[0]
IndexError: list index out of range

Let me know if more information are needed.

To Reproduce

Using Splink 3.9.14

import pandas as pd
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_template_library as ctl

data = [
    {"unique_id": 1, "name": "Aaaaaaaaa"},
    {"unique_id": 2, "name": "Bbbbbbbbb"},
    {"unique_id": 3, "name": "Ccccccccc"},
]
df = pd.DataFrame.from_dict(data)

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [block_on(["name"])],
    "comparisons": [
        ctl.name_comparison("name", term_frequency_adjustments=True),
    ],
}
linker = DuckDBLinker(df, settings)

df_predict = linker.predict(threshold_match_probability=0.1)
print(df_predict.as_record_dict())
# Prints "[]"

df_cluster = linker.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=0.1
)
print(df_cluster.as_record_dict())
# Prints "[{'cluster_id': 1, 'unique_id': 1, 'name': 'Aaaaaaaaa', '__splink_salt': 0.11938597843982279, 'tf_name': 0.3333333333333333}, {'cluster_id': 2, 'unique_id': 2, 'name': 'Bbbbbbbbb', '__splink_salt': 0.8574480421375483, 'tf_name': 0.3333333333333333}, {'cluster_id': 3, 'unique_id': 3, 'name': 'Ccccccccc', '__splink_salt': 0.9847083757631481, 'tf_name': 0.3333333333333333}]"

Using Splink 4.0.5

data = [
    {"unique_id": 1, "name": "Aaaaaaaaa"},
    {"unique_id": 2, "name": "Bbbbbbbbb"},
    {"unique_id": 3, "name": "Ccccccccc"},
]
df = pd.DataFrame.from_dict(data)

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[block_on("name")],
    comparisons=[cl.NameComparison("name")]
)

db_api = DuckDBAPI()
linker = Linker(df, settings, db_api=db_api)

df_predict = linker.inference.predict(threshold_match_probability=0.1)
print(df_predict.as_record_dict())
# Prints "[]"

df_cluster = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=0.1
)
# Throws the following error:
# Traceback (most recent call last):
#   File "/test.py", line 32, in test
#     df_cluster = linker.clustering.cluster_pairwise_predictions_at_threshold(
#   File "/opt/miniconda3/envs/linkage-clustering-linkage-splink2/lib/python3.9/site-packages/splink/internals/linker_components/clustering.py", line 92, in cluster_pairwise_predictions_at_threshold
#     c.unquote().name for c in df_predict.columns
#   File "/opt/miniconda3/envs/linkage-clustering-linkage-splink2/lib/python3.9/site-packages/splink/internals/duckdb/dataframe.py", line 23, in columns
#     d = self.as_record_dict(1)[0]
# IndexError: list index out of range

OS:

MacOS

Splink version:

4.0.5

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

thibault1024 commented 2 weeks ago

Thanks a lot @ADBond !