First of all, thanks a lot for your work with Splink !
I am a Splink afficionado and I am currently updating my Splink based application from 3.9.14 to 4.0.5.
No real problems to perform this update except this difference of behavior I noticed related to the cluster_pairwise_predictions_at_threshold(...) method.
In Splink 3, when this method is called with an empty df_predictdataframe, clusters with one item are returned.
In Splink 4, when this method is called with an empty df_predictdataframe, the following error occurs:
Traceback (most recent call last):
File "/test.py", line 32, in test
df_cluster = linker.clustering.cluster_pairwise_predictions_at_threshold(
File "/opt/miniconda3/envs/linkage-clustering-linkage-splink2/lib/python3.9/site-packages/splink/internals/linker_components/clustering.py", line 92, in cluster_pairwise_predictions_at_threshold
c.unquote().name for c in df_predict.columns
File "/opt/miniconda3/envs/linkage-clustering-linkage-splink2/lib/python3.9/site-packages/splink/internals/duckdb/dataframe.py", line 23, in columns
d = self.as_record_dict(1)[0]
IndexError: list index out of range
data = [
{"unique_id": 1, "name": "Aaaaaaaaa"},
{"unique_id": 2, "name": "Bbbbbbbbb"},
{"unique_id": 3, "name": "Ccccccccc"},
]
df = pd.DataFrame.from_dict(data)
settings = SettingsCreator(
link_type="dedupe_only",
blocking_rules_to_generate_predictions=[block_on("name")],
comparisons=[cl.NameComparison("name")]
)
db_api = DuckDBAPI()
linker = Linker(df, settings, db_api=db_api)
df_predict = linker.inference.predict(threshold_match_probability=0.1)
print(df_predict.as_record_dict())
# Prints "[]"
df_cluster = linker.clustering.cluster_pairwise_predictions_at_threshold(
df_predict, threshold_match_probability=0.1
)
# Throws the following error:
# Traceback (most recent call last):
# File "/test.py", line 32, in test
# df_cluster = linker.clustering.cluster_pairwise_predictions_at_threshold(
# File "/opt/miniconda3/envs/linkage-clustering-linkage-splink2/lib/python3.9/site-packages/splink/internals/linker_components/clustering.py", line 92, in cluster_pairwise_predictions_at_threshold
# c.unquote().name for c in df_predict.columns
# File "/opt/miniconda3/envs/linkage-clustering-linkage-splink2/lib/python3.9/site-packages/splink/internals/duckdb/dataframe.py", line 23, in columns
# d = self.as_record_dict(1)[0]
# IndexError: list index out of range
OS:
MacOS
Splink version:
4.0.5
Have you tried this on the latest master branch?
[x] I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
What happens?
First of all, thanks a lot for your work with Splink !
I am a Splink afficionado and I am currently updating my Splink based application from 3.9.14 to 4.0.5.
No real problems to perform this update except this difference of behavior I noticed related to the
cluster_pairwise_predictions_at_threshold(...)
method.In Splink 3, when this method is called with an empty
df_predict
dataframe, clusters with one item are returned.In Splink 4, when this method is called with an empty
df_predict
dataframe, the following error occurs:Let me know if more information are needed.
To Reproduce
Using Splink 3.9.14
Using Splink 4.0.5
OS:
MacOS
Splink version:
4.0.5
Have you tried this on the latest
master
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?