kearnz / autoimpute

Python package for Imputation Methods
MIT License
241 stars 19 forks source link

Categorical columns are not imputed by default #81

Open AnotherSamWilson opened 2 years ago

AnotherSamWilson commented 2 years ago

See the following example:

from autoimpute.imputations import MultipleImputer, SingleImputer

from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
from datetime import datetime as dt

def amp_data(dat, rows):
    amputed_data = dat.copy()
    data_shape = dat.shape
    for col in np.arange(data_shape[1]):
        na_ind = random_state.choice(
            np.arange(data_shape[0]), replace=False, size=rows
        )
        amputed_data.iloc[na_ind, col] = np.NaN

    return amputed_data

if __name__ == "__main__":
    random_state = np.random.RandomState(5)
    iris = pd.concat(load_iris(return_X_y=True, as_frame=True), axis=1)
    iris["target"] = iris["target"].astype("category")
    iris.columns = [c.replace(" ", "") for c in iris.columns]

    amputed_data = amp_data(iris, 30)

    start = dt.now()
    mi = SingleImputer()
    mi.fit(amputed_data)
    mit = mi.transform(amputed_data)

    print(mit.isnull().sum())

The target column is not imputed in the transform. I do see that the default value for strategy is to use the "predictive default" imputer, which ends up being PMM for numeric columns and multinomial logistic for categorical columns. I would think that categories would be imputed by default. Is there a bug, or some setting I am not aware of?

AnotherSamWilson commented 2 years ago

After some experimenting, I see that setting the datatype to "object" fixes this. Still, it's probably a good idea to allow categorical data types to be found by whatever categorical imputer the user selects.

kearnz commented 2 years ago

agreed that pd.Categorical should be handled. I'd assume in this case we'd treat them the same as objects for which we implicitly assume categories (generally the objects are str). Let me know if you see any issue with that or if you'd expect the categorical type to be handled differently.