skrub-data / skrub

Prepping tables for machine learning
https://skrub-data.org/
BSD 3-Clause "New" or "Revised" License
1.23k stars 98 forks source link

TableReport fails for tables with identical column names #1117

Closed Vincent-Maladiere closed 1 month ago

Vincent-Maladiere commented 1 month ago

Describe the bug

When working with TableReport on a subset of a dataframe defined with a list of columns, it's easy to duplicate a column name mistakingly.

Currently, the error returned by TableReport is not informative. Instead of raising an error, we could select the first occurrence of the same column in the dataframe, or raise a clearer error if we foresee some edge cases.

Steps/Code to Reproduce

import pandas as pd
from skrub import TableReport

df = pd.DataFrame({"a": [1, 3]})
TableReport(df[["a", "a"]]).open()

Expected Results

No error

Actual Results

It fails because column in sbd.name(column) is a dataframe.

traceback ``` /Users/vincentmaladiere/dev/inria/skrub/skrub/_reporting/_utils.py:24: UserWarning: DataFrame columns are not unique, some columns will be omitted. return df.to_dict(orient="list") --------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) Cell In[2], line 5 2 from skrub import TableReport 4 df = pd.DataFrame({"a": [1, 3]}) ----> 5 TableReport(df[["a", "a"]]).open() File ~/dev/inria/skrub/skrub/_reporting/_table_report.py:194, in TableReport.open(self) 192 def open(self): 193 """Open the HTML report in a web browser.""" --> 194 open_in_browser(self.html()) File ~/dev/inria/skrub/skrub/_reporting/_table_report.py:152, in TableReport.html(self) 143 def html(self): 144 """Get the report as a full HTML page. 145 146 Returns (...) 149 The HTML page. 150 """ 151 return to_html( --> 152 self._summary_with_plots, 153 standalone=True, 154 column_filters=self.column_filters, 155 ) File ~/miniforge3/envs/skrub/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner) 999 val = cache.get(self.attrname, _NOT_FOUND) 1000 if val is _NOT_FOUND: -> 1001 val = self.func(instance) 1002 try: 1003 cache[self.attrname] = val File ~/dev/inria/skrub/skrub/_reporting/_table_report.py:127, in TableReport._summary_with_plots(self) 125 @functools.cached_property 126 def _summary_with_plots(self): --> 127 return summarize_dataframe( 128 self.dataframe, with_plots=True, title=self.title, **self._summary_kwargs 129 ) File ~/dev/inria/skrub/skrub/_reporting/_summarize.py:78, in summarize_dataframe(df, order_by, with_plots, title, max_top_slice_size, max_bottom_slice_size) 73 for position, column_name in enumerate(sbd.column_names(df)): 74 print( 75 f"Processing column {position + 1: >3} / {n_columns}", end="\r", flush=True 76 ) 77 summary["columns"].append( ---> 78 _summarize_column( 79 sbd.col(df, column_name), 80 position, 81 dataframe_summary=summary, 82 with_plots=with_plots, 83 order_by_column=None if order_by is None else sbd.col(df, order_by), 84 ) 85 ) 86 print(flush=True) 87 summary["n_constant_columns"] = sum( 88 c["value_is_constant"] for c in summary["columns"] 89 ) File ~/dev/inria/skrub/skrub/_reporting/_summarize.py:109, in _summarize_column(column, position, dataframe_summary, with_plots, order_by_column) 103 def _summarize_column( 104 column, position, dataframe_summary, *, with_plots, order_by_column 105 ): 106 summary = { 107 "position": position, 108 "idx": position, --> 109 "name": sbd.name(column), 110 "dtype": _utils.get_dtype_name(column), 111 "value_is_constant": False, 112 } 113 _add_nulls_summary(summary, column, dataframe_summary=dataframe_summary) 114 if summary["null_count"] == dataframe_summary["n_rows"]: File ~/miniforge3/envs/skrub/lib/python3.11/functools.py:909, in singledispatch..wrapper(*args, **kw) 905 if not args: 906 raise TypeError(f'{funcname} requires at least ' 907 '1 positional argument') --> 909 return dispatch(args[0].__class__)(*args, **kw) File ~/dev/inria/skrub/skrub/_dataframe/_common.py:407, in name(col) 405 @dispatch 406 def name(col): --> 407 raise NotImplementedError() NotImplementedError: ```

Versions

0.4.dev0