Null columns break the linker.profile_columns method with the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-4-eb3fbd260dd8> in <module>
----> 1 linker.profile_columns(["null_col"], top_n=10, bottom_n=5)
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/splink/linker.py in profile_columns(self, column_expressions, top_n, bottom_n)
1226 ):
1227
-> 1228 return profile_columns(self, column_expressions, top_n=top_n, bottom_n=bottom_n)
1229
1230 def estimate_m_from_pairwise_labels(self, table_name):
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/splink/profile_data.py in profile_columns(linker, column_expressions, top_n, bottom_n)
207 p for p in percentile_rows_all if p["group_name"] == _group_name(expression)
208 ]
--> 209 percentile_rows = _add_100_percentile_to_df_percentiles(percentile_rows)
210 top_n_rows = [
211 p for p in top_n_rows_all if p["group_name"] == _group_name(expression)
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/splink/profile_data.py in _add_100_percentile_to_df_percentiles(percentile_rows)
159 def _add_100_percentile_to_df_percentiles(percentile_rows):
160
--> 161 r = percentile_rows[0]
162 if r["percentile_ex_nulls"] != 1.0:
163 first_row = deepcopy(r)
IndexError: list index out of range
It is assumed that percentile_rows is not null. Easy enough to check that percentile_rows is non-null before proceeding to _add_100_percentile_to_df_percentiles(percentile_rows), but not sure what the desired behaviour would be - a warning or just silently not creating a chart for that column?
To Reproduce
Add a null column to any input df and then profile only that column:
What happens?
Null columns break the
linker.profile_columns
method with the following error:It is assumed that
percentile_rows
is not null. Easy enough to check thatpercentile_rows
is non-null before proceeding to_add_100_percentile_to_df_percentiles(percentile_rows)
, but not sure what the desired behaviour would be - a warning or just silently not creating a chart for that column?To Reproduce
Add a null column to any input
df
and then profile only that column:OS:
All environments
Splink version:
3.3.3
Have you tried this on the latest
master
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?