kearnz / autoimpute

Python package for Imputation Methods
MIT License
237 stars 19 forks source link

Bug in MissingnessClassifier -> TypeError: data type 'k' not understood #56

Closed duncanjjansen closed 3 years ago

duncanjjansen commented 3 years ago

The following works:

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

   a   b     c
0  1   2   NaN
1  5  10  20.0

MissingnessClassifier().fit(df)

MissingnessClassifier(classifier=XGBClassifier(base_score=None, booster=None,
                                               colsample_bylevel=None,
                                               colsample_bynode=None,
                                               colsample_bytree=None,
                                               gamma=None, gpu_id=None,
                                               importance_type='gain',
                                               interaction_constraints=None,
                                               learning_rate=None,
                                               max_delta_step=None,
                                               max_depth=None,
                                               min_child_weight=None,
                                               missing=nan,
                                               monotone_constraints=None,
                                               n_estimators=100, n_jobs=None,
                                               num_parallel_tree=None,
                                               random_state=None,
                                               reg_alpha=None, reg_lambda=None,
                                               scale_pos_weight=None,
                                               subsample=None, tree_method=None,
                                               validate_parameters=None,
                                               verbosity=None))

However, when I name column 'a' -> 'k':

data_test = [{'k': 1, 'b': 2},{'k': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data_test)
print(df)

   k   b     c
0  1   2   NaN
1  5  10  20.0

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-125-8fdb2eb9e304> in <module>
----> 1 MissingnessClassifier().fit(df)

C:\ProgramData\Miniconda3\envs\project_prorail\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
     59             err = f"Neither {d_err} nor {a_err} are of type pd.DataFrame"
     60             raise TypeError(err)
---> 61         return func(d, *args, **kwargs)
     62     return wrapper
     63 

C:\ProgramData\Miniconda3\envs\project_prorail\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
    124 
    125         # return func if no missingness violations detected, then return wrap
--> 126         return func(d, *args, **kwargs)
    127     return wrapper
    128 

C:\ProgramData\Miniconda3\envs\project_prorail\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
    171             err = f"All values missing in column(s) {nc}. Should be removed."
    172             raise ValueError(err)
--> 173         return func(d, *args, **kwargs)
    174     return wrapper
    175 

C:\ProgramData\Miniconda3\envs\project_prorail\lib\site-packages\autoimpute\imputations\mis_classifier.py in fit(self, X, **kwargs)
    135         for column in self.data_mi:
    136             # only fit non time-based columns...
--> 137             if not np.issubdtype(column, np.datetime64):
    138                 y = self.data_mi[column]
    139                 preds = self._preds[column]

C:\ProgramData\Miniconda3\envs\project_prorail\lib\site-packages\numpy\core\numerictypes.py in issubdtype(arg1, arg2)
    386     """
    387     if not issubclass_(arg1, generic):
--> 388         arg1 = dtype(arg1).type
    389     if not issubclass_(arg2, generic):
    390         arg2 = dtype(arg2).type

TypeError: data type 'k' not understood

Does anyone have a clue why this would happen and how to fix?

python version 3.8.3

kearnz commented 3 years ago

@duncanjjansen This is a numpy error, but the bug stems from the fact that the autoimpute code for the MissingnessClassifier uses an underlying numpy function incorrectly.

What the MissingnessClassifier should do is check the dtype of each column and ensure that it is not a date (autoimute does not currently support imputation for dates). In the code, the MissingnessClassifier passes each column name to np.issubdtype, but it should pass the dtype of the column instead. You'll see in the numpy docs that the np.issubdtype function takes a dtype or a string representing a typecode - not a column name (obviously)!

The confusion here stems from the fact that some column names are reserved strings that represent dtype codes! a, b, c are strings representing dtype codes, while k is not. If you change k to S, the code will erroneously run, but if you change k to s you'll get the same error as with k.

I'll have a patch ready for this in the next or so. Thanks for catching this!

duncanjjansen commented 3 years ago

@kearnz Thanks for the quik reply. Makes sense, had a feeling it was something like this. I'll be waiting for that patch :)

kearnz commented 3 years ago

@duncanjjansen I just released version 0.12.1. This should fix the bug you identified. Let me know if you're having any other problems!