moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.34k stars 147 forks source link

bug: "can't take logarithm of zero" #1889

Open NickCrews opened 9 months ago

NickCrews commented 9 months ago

What happens?

I think we need to add in some safeguards when calculating log2(*bayes_factors) AS match_weight.

I didn't include a reproducible example, but I think you can see how this would come about:

  1. if any of the BFs are 0
  2. or, if they are so small that with floating point errors the multiplication of them is 0

then the arg passed to log2 will be zero.

Two options?

  1. add in a tiny delta like LOG2(.00000000000000000000000001 + bf1*bf2*bf3...)
  2. use CASE WHEN <args> > 0 THEN LOG2(<args>> ELSE "-inf" END or similar
```python-traceback OutOfRangeException: Out of Range Error: cannot take logarithm of zero The above exception was the direct cause of the following exception: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ | /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/linker.py:1798 in predict │ │ │ │ 1795 │ │ for sql in sqls: │ │ 1796 │ │ │ self._enqueue_sql(sql["sql"], sql["output_table_name"]) │ │ 1797 │ │ │ │ ❱ 1798 │ │ predictions = self._execute_sql_pipeline(input_dataframes) │ │ 1799 │ │ self._predict_warning() │ │ 1800 │ │ │ │ 1801 │ │ [b.drop_materialised_id_pairs_dataframe() for b in exploding_br_with_id_tables] │ │ │ │ /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/linker.py:650 in │ │ _execute_sql_pipeline │ │ │ │ 647 │ │ │ │ │ use_cache, │ │ 648 │ │ │ │ ) │ │ 649 │ │ │ except Exception as e: │ │ ❱ 650 │ │ │ │ raise e │ │ 651 │ │ │ finally: │ │ 652 │ │ │ │ self._pipeline.reset() │ │ 653 │ │ │ │ /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/linker.py:644 in │ │ _execute_sql_pipeline │ │ │ │ 641 │ │ │ output_tablename_templated = self._pipeline.queue[-1].output_table_name │ │ 642 │ │ │ │ │ 643 │ │ │ try: │ │ ❱ 644 │ │ │ │ dataframe = self._sql_to_splink_dataframe_checking_cache( │ │ 645 │ │ │ │ │ sql_gen, │ │ 646 │ │ │ │ │ output_tablename_templated, │ │ 647 │ │ │ │ │ use_cache, │ │ │ │ /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/linker.py:910 in │ │ _sql_to_splink_dataframe_checking_cache │ │ │ │ 907 │ │ │ │ print(df_pd) # noqa: T201 │ │ 908 │ │ │ │ 909 │ │ else: │ │ ❱ 910 │ │ │ splink_dataframe = self._execute_sql_against_backend( │ │ 911 │ │ │ │ sql, output_tablename_templated, table_name_hash │ │ 912 │ │ │ ) │ │ 913 │ │ │ self._intermediate_table_cache.executed_queries.append(splink_dataframe) │ │ │ │ /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/duckdb/linker.py:220 in │ │ _execute_sql_against_backend │ │ │ │ 217 │ │ AS │ │ 218 │ │ ({sql}) │ │ 219 │ │ """ │ │ ❱ 220 │ │ self._log_and_run_sql_execution(sql, templated_name, physical_name) │ │ 221 │ │ │ │ 222 │ │ return DuckDBDataFrame(templated_name, physical_name, self) │ │ 223 │ │ │ │ /Users/nc/code/scg/atlas/.venv/lib/python3.11/site-packages/splink/linker.py:724 in │ │ _log_and_run_sql_execution │ │ │ │ 721 │ │ │ except Exception: │ │ 722 │ │ │ │ pass │ │ 723 │ │ │ │ │ ❱ 724 │ │ │ raise SplinkException( │ │ 725 │ │ │ │ f"Error executing the following sql for table " │ │ 726 │ │ │ │ f"`{templated_name}`({physical_name}):\n{final_sql}" │ │ 727 │ │ │ │ f"\n\nError was: {e}" │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ SplinkException: Error executing the following sql for table `__splink__df_predict`(__splink__df_predict_d7948ec58): CREATE TABLE __splink__df_predict_d7948ec58 AS ( WITH __splink__df_concat_with_tf AS ( SELECT * FROM __splink__df_concat_with_tf_7a64735cd ), __splink__df_blocked AS ( SELECT "l"."raw__politician__id" AS "raw__politician__id_l", "r"."raw__politician__id" AS "raw__politician__id_r", "l"."politician__first_name" AS "politician__first_name_l", "r"."politician__first_name" AS "politician__first_name_r", "l"."tf_politician__first_name" AS "tf_politician__first_name_l", "r"."tf_politician__first_name" AS "tf_politician__first_name_r", "l"."politician__first_name_sorted" AS "politician__first_name_sorted_l", "r"."politician__first_name_sorted" AS "politician__first_name_sorted_r", "l"."politician__first_name_gender" AS "politician__first_name_gender_l", "r"."politician__first_name_gender" AS "politician__first_name_gender_r", "l"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_l", "r"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_r", "l"."politician__last_name" AS "politician__last_name_l", "r"."politician__last_name" AS "politician__last_name_r", "l"."tf_politician__last_name" AS "tf_politician__last_name_l", "r"."tf_politician__last_name" AS "tf_politician__last_name_r", "l"."politician__last_name_sorted" AS "politician__last_name_sorted_l", "r"."politician__last_name_sorted" AS "politician__last_name_sorted_r", "l"."politician__party" AS "politician__party_l", "r"."politician__party" AS "politician__party_r", "l"."politician__state" AS "politician__state_l", "r"."politician__state" AS "politician__state_r", "l"."politician__city" AS "politician__city_l", "r"."politician__city" AS "politician__city_r", "l"."src" AS "src_l", "r"."src" AS "src_r", "l"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l", "r"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r", '0' AS match_key FROM __splink__df_concat_with_tf AS l INNER JOIN __splink__df_concat_with_tf AS r ON ( ( ( l.src <> r.src ) AND ( l.politician__first_name_sorted = r.politician__first_name_sorted ) ) AND ( l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <> r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" ) ) WHERE l."raw__politician__id" < r."raw__politician__id" UNION ALL SELECT "l"."raw__politician__id" AS "raw__politician__id_l", "r"."raw__politician__id" AS "raw__politician__id_r", "l"."politician__first_name" AS "politician__first_name_l", "r"."politician__first_name" AS "politician__first_name_r", "l"."tf_politician__first_name" AS "tf_politician__first_name_l", "r"."tf_politician__first_name" AS "tf_politician__first_name_r", "l"."politician__first_name_sorted" AS "politician__first_name_sorted_l", "r"."politician__first_name_sorted" AS "politician__first_name_sorted_r", "l"."politician__first_name_gender" AS "politician__first_name_gender_l", "r"."politician__first_name_gender" AS "politician__first_name_gender_r", "l"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_l", "r"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_r", "l"."politician__last_name" AS "politician__last_name_l", "r"."politician__last_name" AS "politician__last_name_r", "l"."tf_politician__last_name" AS "tf_politician__last_name_l", "r"."tf_politician__last_name" AS "tf_politician__last_name_r", "l"."politician__last_name_sorted" AS "politician__last_name_sorted_l", "r"."politician__last_name_sorted" AS "politician__last_name_sorted_r", "l"."politician__party" AS "politician__party_l", "r"."politician__party" AS "politician__party_r", "l"."politician__state" AS "politician__state_l", "r"."politician__state" AS "politician__state_r", "l"."politician__city" AS "politician__city_l", "r"."politician__city" AS "politician__city_r", "l"."src" AS "src_l", "r"."src" AS "src_r", "l"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l", "r"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r", '1' AS match_key FROM __splink__df_concat_with_tf AS l INNER JOIN __splink__df_concat_with_tf AS r ON ( ( ( l.src <> r.src ) AND ( l.politician__last_name_sorted = r.politician__last_name_sorted ) ) AND ( l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <> r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" ) ) WHERE l."raw__politician__id" < r."raw__politician__id" AND NOT ( COALESCE( ( ( ( l.src <> r.src ) AND ( l.politician__first_name_sorted = r.politician__first_name_sorted ) ) AND ( l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <> r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" ) ), FALSE ) ) UNION ALL SELECT "l"."raw__politician__id" AS "raw__politician__id_l", "r"."raw__politician__id" AS "raw__politician__id_r", "l"."politician__first_name" AS "politician__first_name_l", "r"."politician__first_name" AS "politician__first_name_r", "l"."tf_politician__first_name" AS "tf_politician__first_name_l", "r"."tf_politician__first_name" AS "tf_politician__first_name_r", "l"."politician__first_name_sorted" AS "politician__first_name_sorted_l", "r"."politician__first_name_sorted" AS "politician__first_name_sorted_r", "l"."politician__first_name_gender" AS "politician__first_name_gender_l", "r"."politician__first_name_gender" AS "politician__first_name_gender_r", "l"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_l", "r"."tf_politician__first_name_gender" AS "tf_politician__first_name_gender_r", "l"."politician__last_name" AS "politician__last_name_l", "r"."politician__last_name" AS "politician__last_name_r", "l"."tf_politician__last_name" AS "tf_politician__last_name_l", "r"."tf_politician__last_name" AS "tf_politician__last_name_r", "l"."politician__last_name_sorted" AS "politician__last_name_sorted_l", "r"."politician__last_name_sorted" AS "politician__last_name_sorted_r", "l"."politician__party" AS "politician__party_l", "r"."politician__party" AS "politician__party_r", "l"."politician__state" AS "politician__state_l", "r"."politician__state" AS "politician__state_r", "l"."politician__city" AS "politician__city_l", "r"."politician__city" AS "politician__city_r", "l"."src" AS "src_l", "r"."src" AS "src_r", "l"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l", "r"."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" AS "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r", '2' AS match_key FROM __splink__df_concat_with_tf AS l INNER JOIN __splink__df_concat_with_tf AS r ON ( ( ( l.src <> r.src ) AND ( l.politician__state = r.politician__state ) ) AND ( l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <> r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" ) ) WHERE l."raw__politician__id" < r."raw__politician__id" AND NOT ( COALESCE( ( ( ( l.src <> r.src ) AND ( l.politician__first_name_sorted = r.politician__first_name_sorted ) ) AND ( l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <> r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" ) ), FALSE ) OR COALESCE( ( ( ( l.src <> r.src ) AND ( l.politician__last_name_sorted = r.politician__last_name_sorted ) ) AND ( l."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" <> r."DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)" ) ), FALSE ) ) ), __splink__df_comparison_vectors AS ( SELECT "raw__politician__id_l", "raw__politician__id_r", "politician__first_name_l", "politician__first_name_r", "politician__first_name_sorted_l", "politician__first_name_sorted_r", "politician__first_name_gender_l", "politician__first_name_gender_r", CASE WHEN "politician__first_name_l" IS NULL OR "politician__first_name_r" IS NULL THEN -1 WHEN politician__first_name_l[1] = politician__first_name_r[1] AND ( LENGTH(politician__first_name_r) = 1 OR LENGTH(politician__first_name_l) = 1 ) THEN 5 WHEN "politician__first_name_l" = "politician__first_name_r" THEN 4 WHEN ARE_ALIASES(politician__first_name_l, politician__first_name_r) THEN 3 WHEN ( DAMERAU_LEVENSHTEIN("politician__first_name_l", "politician__first_name_r") <= 1 ) OR ( ( DAMERAU_LEVENSHTEIN("politician__first_name_l", "politician__first_name_r") <= 2 ) AND ( STRLEN(politician__first_name_l) >= 5 ) ) OR ( ( ( CONTAINS(politician__first_name_l, politician__first_name_r) AND LENGTH(politician__first_name_r) >= 3 ) OR ( CONTAINS(politician__first_name_r, politician__first_name_l) AND LENGTH(politician__first_name_l) >= 3 ) ) ) OR ( "politician__first_name_sorted_l" = "politician__first_name_sorted_r" ) THEN 2 WHEN politician__first_name_gender_l = politician__first_name_gender_r THEN 1 ELSE 0 END AS gamma_politician__first_name, "tf_politician__first_name_l", "tf_politician__first_name_r", "tf_politician__first_name_gender_l", "tf_politician__first_name_gender_r", "politician__last_name_l", "politician__last_name_r", "politician__last_name_sorted_l", "politician__last_name_sorted_r", CASE WHEN "politician__last_name_l" IS NULL OR "politician__last_name_r" IS NULL THEN -1 WHEN "politician__last_name_l" = "politician__last_name_r" THEN 2 WHEN ( DAMERAU_LEVENSHTEIN("politician__last_name_l", "politician__last_name_r") <= 1 ) OR ( ( DAMERAU_LEVENSHTEIN("politician__last_name_l", "politician__last_name_r") <= 2 ) AND ( STRLEN(politician__last_name_l) >= 5 ) ) OR ( ( ( CONTAINS(politician__last_name_l, politician__last_name_r) AND LENGTH(politician__last_name_r) >= 3 ) OR ( CONTAINS(politician__last_name_r, politician__last_name_l) AND LENGTH(politician__last_name_l) >= 3 ) ) ) OR ( "politician__last_name_sorted_l" = "politician__last_name_sorted_r" ) THEN 1 ELSE 0 END AS gamma_politician__last_name, "tf_politician__last_name_l", "tf_politician__last_name_r", "politician__party_l", "politician__party_r", CASE WHEN "politician__party_l" = "politician__party_r" THEN 1 WHEN "politician__party_l" IS NULL OR "politician__party_r" IS NULL THEN -1 ELSE 0 END AS gamma_party, "politician__state_l", "politician__state_r", "politician__city_l", "politician__city_r", CASE WHEN ( "politician__city_l" IS NULL OR "politician__city_r" IS NULL ) AND ( "politician__state_l" IS NULL OR "politician__state_r" IS NULL ) THEN -1 WHEN ( "politician__city_l" = "politician__city_r" ) AND ( "politician__state_l" = "politician__state_r" ) THEN 2 WHEN ( "politician__city_l" = "politician__city_r" ) OR ( "politician__state_l" = "politician__state_r" ) THEN 1 ELSE 0 END AS gamma_city_state, "src_l", "src_r", "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l", "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r", match_key FROM __splink__df_blocked ), __splink__df_match_weight_parts AS ( SELECT "raw__politician__id_l", "raw__politician__id_r", "politician__first_name_l", "politician__first_name_r", "politician__first_name_sorted_l", "politician__first_name_sorted_r", "politician__first_name_gender_l", "politician__first_name_gender_r", gamma_politician__first_name, "tf_politician__first_name_l", "tf_politician__first_name_r", "tf_politician__first_name_gender_l", "tf_politician__first_name_gender_r", CASE WHEN gamma_politician__first_name = -1 THEN CAST(1.0 AS DOUBLE) WHEN gamma_politician__first_name = 5 THEN CAST(1.0633529488606062e-180 AS DOUBLE) WHEN gamma_politician__first_name = 4 THEN CAST(134.4620704350685 AS DOUBLE) WHEN gamma_politician__first_name = 3 THEN CAST(16.68908639059793 AS DOUBLE) WHEN gamma_politician__first_name = 2 THEN CAST(9.78866591115e-313 AS DOUBLE) WHEN gamma_politician__first_name = 1 THEN CAST(2.0336191245597566e-24 AS DOUBLE) WHEN gamma_politician__first_name = 0 THEN CAST(0.31856082028220833 AS DOUBLE) END AS bf_politician__first_name, CASE WHEN gamma_politician__first_name = -1 THEN CAST(1 AS DOUBLE) WHEN gamma_politician__first_name = 5 THEN CAST(1 AS DOUBLE) WHEN gamma_politician__first_name = 4 THEN ( CASE WHEN NOT COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r") IS NULL THEN POWER( CAST(0.006170059161530446 AS DOUBLE) / ( CASE WHEN COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r") >= COALESCE("tf_politician__first_name_r", "tf_politician__first_name_l") THEN COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r") ELSE COALESCE("tf_politician__first_name_r", "tf_politician__first_name_l") END ), CAST(1.0 AS DOUBLE) ) ELSE CAST(1 AS DOUBLE) END ) WHEN gamma_politician__first_name = 3 THEN ( CASE WHEN NOT COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r") IS NULL THEN POWER( CAST(0.006170059161530446 AS DOUBLE) / ( CASE WHEN COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r") >= COALESCE("tf_politician__first_name_r", "tf_politician__first_name_l") THEN COALESCE("tf_politician__first_name_l", "tf_politician__first_name_r") ELSE COALESCE("tf_politician__first_name_r", "tf_politician__first_name_l") END ), CAST(1.0 AS DOUBLE) ) ELSE CAST(1 AS DOUBLE) END ) WHEN gamma_politician__first_name = 2 THEN CAST(1 AS DOUBLE) WHEN gamma_politician__first_name = 1 THEN ( CASE WHEN NOT COALESCE("tf_politician__first_name_gender_l", "tf_politician__first_name_gender_r") IS NULL THEN POWER( CAST(0.6262998096190533 AS DOUBLE) / ( CASE WHEN COALESCE("tf_politician__first_name_gender_l", "tf_politician__first_name_gender_r") >= COALESCE("tf_politician__first_name_gender_r", "tf_politician__first_name_gender_l") THEN COALESCE("tf_politician__first_name_gender_l", "tf_politician__first_name_gender_r") ELSE COALESCE("tf_politician__first_name_gender_r", "tf_politician__first_name_gender_l") END ), CAST(1.0 AS DOUBLE) ) ELSE CAST(1 AS DOUBLE) END ) WHEN gamma_politician__first_name = 0 THEN CAST(1 AS DOUBLE) END AS bf_tf_adj_politician__first_name, "politician__last_name_l", "politician__last_name_r", "politician__last_name_sorted_l", "politician__last_name_sorted_r", gamma_politician__last_name, "tf_politician__last_name_l", "tf_politician__last_name_r", CASE WHEN gamma_politician__last_name = -1 THEN CAST(1.0 AS DOUBLE) WHEN gamma_politician__last_name = 2 THEN CAST(2072.5200422351572 AS DOUBLE) WHEN gamma_politician__last_name = 1 THEN CAST(22.29208621508213 AS DOUBLE) WHEN gamma_politician__last_name = 0 THEN CAST(1.033614963801202e-59 AS DOUBLE) END AS bf_politician__last_name, CASE WHEN gamma_politician__last_name = -1 THEN CAST(1 AS DOUBLE) WHEN gamma_politician__last_name = 2 THEN ( CASE WHEN NOT COALESCE("tf_politician__last_name_l", "tf_politician__last_name_r") IS NULL THEN POWER( CAST(0.0004647912544256984 AS DOUBLE) / ( CASE WHEN COALESCE("tf_politician__last_name_l", "tf_politician__last_name_r") >= COALESCE("tf_politician__last_name_r", "tf_politician__last_name_l") THEN COALESCE("tf_politician__last_name_l", "tf_politician__last_name_r") ELSE COALESCE("tf_politician__last_name_r", "tf_politician__last_name_l") END ), CAST(1.0 AS DOUBLE) ) ELSE CAST(1 AS DOUBLE) END ) WHEN gamma_politician__last_name = 1 THEN CAST(1 AS DOUBLE) WHEN gamma_politician__last_name = 0 THEN CAST(1 AS DOUBLE) END AS bf_tf_adj_politician__last_name, "politician__party_l", "politician__party_r", gamma_party, CASE WHEN gamma_party = 1 THEN CAST(3.139473469428194 AS DOUBLE) WHEN gamma_party = -1 THEN CAST(1.0 AS DOUBLE) WHEN gamma_party = 0 THEN CAST(0.08290316547026433 AS DOUBLE) END AS bf_party, "politician__state_l", "politician__state_r", "politician__city_l", "politician__city_r", gamma_city_state, CASE WHEN gamma_city_state = -1 THEN CAST(1.0 AS DOUBLE) WHEN gamma_city_state = 2 THEN CAST(583.62425872987 AS DOUBLE) WHEN gamma_city_state = 1 THEN CAST(5.782229188377593 AS DOUBLE) WHEN gamma_city_state = 0 THEN CAST(0.018929761894094878 AS DOUBLE) END AS bf_city_state, "src_l", "src_r", "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l", "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r", match_key FROM __splink__df_comparison_vectors ) SELECT LOG2( CAST(0.00010001000100010001 AS DOUBLE) * bf_politician__first_name * bf_tf_adj_politician__first_name * bf_politician__last_name * bf_tf_adj_politician__last_name * bf_party * bf_city_state ) AS match_weight, CASE WHEN bf_politician__first_name = CAST('infinity' AS DOUBLE) OR bf_tf_adj_politician__first_name = CAST('infinity' AS DOUBLE) OR bf_politician__last_name = CAST('infinity' AS DOUBLE) OR bf_tf_adj_politician__last_name = CAST('infinity' AS DOUBLE) OR bf_party = CAST('infinity' AS DOUBLE) OR bf_city_state = CAST('infinity' AS DOUBLE) THEN 1.0 ELSE ( CAST(0.00010001000100010001 AS DOUBLE) * bf_politician__first_name * bf_tf_adj_politician__first_name * bf_politician__last_name * bf_tf_adj_politician__last_name * bf_party * bf_city_state ) / ( 1 + ( CAST(0.00010001000100010001 AS DOUBLE) * bf_politician__first_name * bf_tf_adj_politician__first_name * bf_politician__last_name * bf_tf_adj_politician__last_name * bf_party * bf_city_state ) ) END AS match_probability, "raw__politician__id_l", "raw__politician__id_r", "politician__first_name_l", "politician__first_name_r", "politician__first_name_sorted_l", "politician__first_name_sorted_r", "politician__first_name_gender_l", "politician__first_name_gender_r", gamma_politician__first_name, "tf_politician__first_name_l", "tf_politician__first_name_r", "tf_politician__first_name_gender_l", "tf_politician__first_name_gender_r", bf_politician__first_name, bf_tf_adj_politician__first_name, "politician__last_name_l", "politician__last_name_r", "politician__last_name_sorted_l", "politician__last_name_sorted_r", gamma_politician__last_name, "tf_politician__last_name_l", "tf_politician__last_name_r", bf_politician__last_name, bf_tf_adj_politician__last_name, "politician__party_l", "politician__party_r", gamma_party, bf_party, "politician__state_l", "politician__state_r", "politician__city_l", "politician__city_r", gamma_city_state, bf_city_state, "src_l", "src_r", "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_l", "DefiniteMatchKey(politician__first_name:politician__last_name:politician__state)_r", match_key FROM __splink__df_match_weight_parts WHERE LOG2( CAST(0.00010001000100010001 AS DOUBLE) * bf_politician__first_name * bf_tf_adj_politician__first_name * bf_politician__last_name * bf_tf_adj_politician__last_name * bf_party * bf_city_state ) >= -26.57542474467195 ORDER BY 1 NULLS LAST ) Error was: Out of Range Error: cannot take logarithm of zero ```

To Reproduce

sorry, if you really want I can come up with something, but I think you might be better able to come up with it than me.

OS:

duckdb

Splink version:

4.9.11

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

RobinL commented 9 months ago

Yeah, I agree. I think (but not sure, haven't thought that hard) that this is the same as the infinity protection. Which would lead me to think the case statement would be the best option (just for symmetry)

I have wondered before whether there's a third option (which also could potentially deal with the infinity case statements) of having this logic on the python side rather than in the sql.

e.g. could we clamp the value of the bf_ to between some values (e.g. between 1e-100, 1e100 or something), and issue a warning

Another thing I've wondered a bit about is whether we should move to the bf_ being match weights in the sql, which would then be additive. That would at least avoid (possibly?) the floating point issue.

vfrank66 commented 7 months ago

Hey attempting to test out this package and I am receiving this error. I can see the columns that make up this function call failure in sql, but what exactly would be the way to research and address this problem?

This is occurring on my first test dataset, so I do not know how to address it or if it is incorrect column profiling.

JohnHenningsen commented 3 months ago

Hi there, I am also running into this issue, using the duckdb backend. Does anyone know a workaround to fix this issue, at least for the moment? I understand where the issue comes from based on the discussion above, but do not know where to start with the suggested solutions, e.g. adding a tiny delta to prevent log(0).

vfrank66 commented 3 months ago

A simple workaround fix is duplicate a single row with a new unique id. Just one row that is an exact match to another. For me that was enough to fix it.

RobinL commented 3 months ago

It'd be great if someone could find a reprex for this issue. I've not actually encountered it myself. not doubting it exists - I suspect it happens with data of a certain type that we don't usually encounter, possibly such as certain values having no dupes (as vfrank66 alludes to).

If not a reprex, @JohnHenningsen are you able to post a screenshot of the match weight charts - it's possible that provides some insights...

In any case, we should hopefully be able to get round to fixing fairly soon, once Splink 4 is released (which has been absorbing most of our time for some months now)

NickCrews commented 3 months ago

@RobinL Im just re-reading your original response, and yes I think we should switch to combining match weights additively, otherwise Im pretty sure we will run into floating point errors. So that might make this whole thing moot?

RobinL commented 3 months ago

Yeah, agree, it's definitely the right solution. The problem is it's quite a big job because all the visualisations and dashboards expect data in the current format

JohnHenningsen commented 3 months ago

Thanks for the helpful suggestions everyone! Unfortunately our cluster is down at the moment but I will try the simple workaround and help reproduce this issue as soon as possible.

To give a bit of context, aside from a few columns of categorical data we are relying on a product description column to match records. That column contains a 3-10 word string, and we came up with some custom comparisons based on array_intersect. It is quite likely that there are no exact matches, as we have high variance in the data entry for this column.