moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.36k stars 148 forks source link

Tf tables not being correctly referenced in 'estimate_probability_two_random_records_match' #947

Closed RobinL closed 3 months ago

RobinL commented 1 year ago
  Hello, 

I'm getting an error that I haven't seen while trying to run estimate_probability_two_random_records_match(): Screenshot 2022-12-13 at 4 44 44 PM Screenshot 2022-12-13 at 4 44 54 PM

Here's the code I'm trying to run:

comparison_fname = {
        "output_column_name": "fname",
        "comparison_description": "First name exact match, jaro winkler similarity >= 0.8 or >=0.65, and else levels",
        "comparison_levels": [
            null_level("fname"),
            exact_match_level("fname", term_frequency_adjustments=True),
            columns_reversed_level("fname", "lname"),
            levenshtein_level("fname", 2),
            jaro_winkler_level("fname", 0.8),
            jaro_winkler_level("fname", 0.65),
            else_level(),
        ],
    }

    comparison_lname = {
        "output_column_name": "lname",
        "comparison_description": "Last name exact match, jaro-winkler similarity >= 0.8, or else levels",
        "comparison_levels": [
            null_level("lname"),
            exact_match_level("lname", term_frequency_adjustments=True),
            levenshtein_level("lname", 2),
            jaro_winkler_level("lname", 0.8),
            else_level(),
        ],
    }

    comparison_mname = {
        "output_column_name": "mname",
        "comparison_description": "Middle name exact match, null, or else levels",
        "comparison_levels": [
            null_level("mname"),
            exact_match_level("mname"),
            else_level(),
        ],
    }

    comparison_dob = {
        "output_column_name": "dob",
        "comparison_description": "Date of birth null level, levenshtein <= 1 and 2, exact match, or else levels",
        "comparison_levels": [
            null_level("dob"),
            exact_match_level("dob"),
            {
                "sql_condition": "substring(dob_l,1,4) = substring(dob_r,1,4)",
                "label_for_charts": "Equal_Year",
            },
            levenshtein_level("dob", 1),
            # catch if two numbers are switched (1 transposition)
            jaro_winkler_level("dob", 0.95),
            {
                "sql_condition": "(abs(cast(substring(dob_l,1,4) as int) - cast(substring(dob_r,1,4)as int)) >= 16)",
                "label_for_charts": "more than 16 yrs apart",
            },
            else_level(),
        ],
    }

    comparison_address = {
        "output_column_name": "address",
        "comparison_description": "Address null level, exact match, jaro-winkler similarity >= 0.8, jaccard >= 0.9, levenshtein <= 2 or else levels",
        "comparison_levels": [
            exact_match_level("address", term_frequency_adjustments=True),
            null_level("address"),
            jaro_winkler_level("address", 0.7),
            else_level(),
        ],
    }
   comparison_phone = {
            "output_column_name": "phone",
            "comparison_description": "phone null level, exact match",
            "comparison_levels": [
                null_level("phone"),
                exact_match_level("phone"),
                {
                   "sql_condition": "(phone_l = phone2_r) OR (phone2_l = phone2_r)",
                   "label_for_charts": "phone2 matches",
                },
                else_level(),
            ],
        }

    comparison_zip = {
        "output_column_name": "zip",
        "comparison_description": "Zip exact match no null level",
        "comparison_levels": [exact_match_level("zip"), else_level()],
    }
 
    custom_comparisons = [
            comparison_fname,
            comparison_lname,
            comparison_mname,
            comparison_dob,
            comparison_address,
            comparison_phone,
            exact_match("aptnbr"),
            comparison_zip]
    # initialize linker
    settings = {
        "link_type": link_type,
        "blocking_rules_to_generate_predictions": blocking_rules,
        "comparisons": custom_comparisons,
        "unique_id_column_name": "pid",
        "additional_columns_to_retain": ["gender","cntycd"],
    }

    linker = SparkLinker(
            df,
            settings,
            input_table_aliases=["infutor_df", "alesco_df"],
            break_lineage_method="parquet",
            num_partitions_on_repartition=10,
        )
        # get estimate for number of linked records
        deterministic_rules = ["l.fname = r.fname and l.lname = r.lname and l.address = r.address"]
        linker.estimate_probability_two_random_records_match(deterministic_rules, 0.5)

Can you send anything in my code that'd be causing this issue?

Originally posted by @mmagoffin-sd in https://github.com/moj-analytical-services/splink/discussions/940

RobinL commented 3 months ago

Closing now that splink 4 has been released. If error still occurs in Splink 4 we will reopen