moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.4k stars 151 forks source link

Unable to `profile_columns` if a column is entirely NULL #856

Open samnlindsay opened 2 years ago

samnlindsay commented 2 years ago

What happens?

Null columns break the linker.profile_columns method with the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-eb3fbd260dd8> in <module>
----> 1 linker.profile_columns(["null_col"], top_n=10, bottom_n=5)

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/splink/linker.py in profile_columns(self, column_expressions, top_n, bottom_n)
   1226     ):
   1227 
-> 1228         return profile_columns(self, column_expressions, top_n=top_n, bottom_n=bottom_n)
   1229 
   1230     def estimate_m_from_pairwise_labels(self, table_name):

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/splink/profile_data.py in profile_columns(linker, column_expressions, top_n, bottom_n)
    207             p for p in percentile_rows_all if p["group_name"] == _group_name(expression)
    208         ]
--> 209         percentile_rows = _add_100_percentile_to_df_percentiles(percentile_rows)
    210         top_n_rows = [
    211             p for p in top_n_rows_all if p["group_name"] == _group_name(expression)

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/splink/profile_data.py in _add_100_percentile_to_df_percentiles(percentile_rows)
    159 def _add_100_percentile_to_df_percentiles(percentile_rows):
    160 
--> 161     r = percentile_rows[0]
    162     if r["percentile_ex_nulls"] != 1.0:
    163         first_row = deepcopy(r)

IndexError: list index out of range

It is assumed that percentile_rows is not null. Easy enough to check that percentile_rows is non-null before proceeding to _add_100_percentile_to_df_percentiles(percentile_rows), but not sure what the desired behaviour would be - a warning or just silently not creating a chart for that column?

To Reproduce

Add a null column to any input df and then profile only that column:

df["null_col"] = None

linker = DuckDBLinker(df)

linker.profile_columns(["null_col"], top_n=10, bottom_n=5)

OS:

All environments

Splink version:

3.3.3

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

RossKen commented 1 year ago

To fix - create warning message listing NULL column but allowing the rest of the method to run