scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 395 forks source link

Returning a numpy array in one hot encoder #442

Open adriencrtr opened 2 months ago

adriencrtr commented 2 months ago

Expected Behavior

Even if the category_encoders.one_hot.OneHotEncoder doesn't encode any features, we would expect it to convert a pd.DataFrame into a numpy.ndarray if we set the parameter : return_df=False

Actual Behavior

When the category_encoders.one_hot.OneHotEncoder deals with a dataframe with only numerical features, the parameter cols is empty and the parameter return_df=False, the fit_transform method returns a pd.DataFrame object.

Steps to Reproduce the Problem

import numpy as np
import pandas as pd

from category_encoders.one_hot import OneHotEncoder

rng = np.random.RandomState(42)

This works

n_rows = 100

col1 = rng.rand(n_rows) * 100
col2 = rng.randint(1, 100, n_rows)
col3 = rng.choice([True, False], n_rows)
modalities = ['A', 'B', 'C', 'D']
col4 = rng.choice(modalities, n_rows)

df = pd.DataFrame({
    'Numeric1': col1,
    'Numeric2': col2,
    'Boolean': col3,
    'Object': col4
})

encoder = OneHotEncoder(
    cols=df.select_dtypes(include=["object", "bool"]).columns,
    return_df=False,
    handle_missing='return_nan'
)
X = encoder.fit_transform(df)
type(X)

Out: pandas.core.frame.DataFrame

This is the unexpected behavior

data = rng.multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=200)
df = pd.DataFrame(data=data, columns=["Column 1", "Column 2"])

encoder = OneHotEncoder(
    cols=df.select_dtypes(include=["object", "bool"]).columns,
    return_df=False,
    handle_missing='return_nan'
)
X = encoder.fit_transform(df)
type(X)

Out: numpy.ndarray

Specifications

PaulWestenthanner commented 1 month ago

Hi @adriencrtr the issue is actually that the cols must be list rather than a pandas column object. Column object should be supported though in the future, that'd be a useful addition. I'm leaving the issue open to remind myself of adding support for columns.

Also in the case at hand there are no columns of type object or bool. Hence the input is returned https://github.com/scikit-learn-contrib/category_encoders/blob/11fbba6520341e9b960d35dafd44704a67b5bafe/category_encoders/utils.py#L500-L501