sdv-dev / RDT

A library of Reversible Data Transforms
Other
117 stars 24 forks source link

Fitting with `numerical` column names fails #328

Open pvk-developer opened 2 years ago

pvk-developer commented 2 years ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

When fitting any Transformer with a pd.DataFrame that contains as column names a RangeIndex, or a numerical value as column name, those end up failing.

This bug can produce two errors:

  1. Multiple columns
  2. Single columns

Steps to reproduce

Multiple columns

from rdt.transformers import OneHotEncoder

data = pd.DataFrame([
    ['a', 'b', 'c'],
    ['d', 'e', 'f']
])

ohe = OneHotEncoder()
ohe.fit(data, data.columns)

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-9be2b41b4858> in <module>
----> 1 ohe.fit(data, data.columns)

~/Projects/sdv-dev/RDT/rdt/transformers/base.py in fit(self, data, columns)
    163                 Column names. Must be present in the data.
    164         """
--> 165         self._store_columns(columns, data)
    166 
    167         columns_data = self._get_columns_data(data, self.columns)

~/Projects/sdv-dev/RDT/rdt/transformers/base.py in _store_columns(self, columns, data)
    112             columns = [columns]
    113 
--> 114         missing = set(columns) - set(data.columns)
    115         if missing:
    116             raise KeyError(f'Columns {missing} were not present in the data.')

~/.virtualenvs/RDT/lib/python3.9/site-packages/pandas/core/indexes/base.py in __hash__(self)
   4076 
   4077     def __hash__(self):
-> 4078         raise TypeError(f"unhashable type: {repr(type(self).__name__)}")
   4079 
   4080     def __setitem__(self, key, value):

TypeError: unhashable type: 'RangeIndex'

Using a single column

ohe.fit(data, 0)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-32-5d4d0160e7be> in <module>
----> 1 ohe.fit(data, 0)

~/Projects/sdv-dev/RDT/rdt/transformers/base.py in fit(self, data, columns)
    168         self._fit(columns_data)
    169 
--> 170         self._build_output_columns(data)
    171 
    172     def _transform(self, columns_data):

~/Projects/sdv-dev/RDT/rdt/transformers/base.py in _build_output_columns(self, data)
    136 
    137     def _build_output_columns(self, data):
--> 138         self.column_prefix = '#'.join(self.columns)
    139         self.output_columns = list(self.get_output_types().keys())
    140 

TypeError: sequence item 0: expected str instance, int found

Notes

This errors appear in _store_columns for multiple columns and _build_output_columns for single column.

npatki commented 2 years ago

I can confirm that this issue still persists in RDT 1.0:

import pandas as pd
from rdt import HyperTransformer

data = pd.DataFrame([
    ['a', 'b', 'c'],
    ['d', 'e', 'f']
])

ht = HyperTransformer()
ht.detect_initial_config(data)
ht.fit_transform(data)

Output:

TypeError: sequence item 0: expected str instance, int found