scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.39k stars 393 forks source link

BaseNEncoder.inverse_transform fails when column contains regex metacharacters #392

Closed pimlock closed 1 year ago

pimlock commented 1 year ago

Expected Behavior

BaseNEncoder.inverse_transform() should work correctly with column names containing regex metacharacters, for example for column names such as: my_column (test), test [123], the characters ()[] will be interpreted as regex's capturing group and character range, but instead should be treated as literals.

See: https://github.com/scikit-learn-contrib/category_encoders/blob/1def42827df4a9404553f41255878c45d754b1a0/category_encoders/basen.py#L269

Actual Behavior

Trying to inverse_transform(), when the input column contained regex metacharacter (e.g. ()) raises exception:

Traceback (most recent call last):
  File "site-packages/IPython/core/interactiveshell.py", line 3397, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-92-c30af6a1928b>", line 10, in <cell line: 10>
    inversed = enc.inverse_transform(transformed)
  File "site-packages/category_encoders/basen.py", line 268, in inverse_transform
    X = self.basen_to_integer(X, self.cols, self.base)
  File "site-packages/category_encoders/basen.py", line 358, in basen_to_integer
    insert_at = out_cols.index(col_list[0])
IndexError: list index out of range

Steps to Reproduce the Problem

from category_encoders import BaseNEncoder
import pandas as pd

col_name = "A (test)"
X = pd.DataFrame(data={col_name: ["A", "B", "A", "C"]})

enc = BaseNEncoder(cols=[col_name]).fit(X)

transformed = enc.transform(X)

# fails with `index 0 is out of bounds`
inversed = enc.inverse_transform(transformed)

Specifications