kearnz / autoimpute

Python package for Imputation Methods
MIT License
241 stars 19 forks source link

SingleImputer training on complete cases only #68

Open stimpfli opened 3 years ago

stimpfli commented 3 years ago

It seems to me that your MiceImputer only uses complete cases to train the SingleImputers but from what I red about MICE imputation it should not be the case:

Step 1: A simple imputation, such as imputing the mean, is performed for every missing value in the dataset. These mean imputations can be thought of as “place holders.” Step 2: The “place holder” mean imputations for one variable (“var”) are set back to missing. Step 3: The observed values from the variable “var” in Step 2 are regressed on the other variables in the imputation model, which may or may not consist of all of the variables in the dataset. In other words, “var” is the dependent variable in a regression model and all the other variables are independent variables in the regression model. These regression models operate under the same assumptions that one would make when performing linear, logistic, or Poison regression models outside of the context of imputing missing data. See for citation: Azur, Melissa J., et al. "Multiple imputation by chained equations: what is it and how does it work?." International journal of methods in psychiatric research 20.1 (2011): 40-49. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

As a consequence when there is a missing value in each row a ValueError is throw similar to the one described in #65 that is to my opinion due to none of the sample surviving the line: https://github.com/kearnz/autoimpute/blob/a214e7ad2c664cd6c57843934ebf159067d6261f/autoimpute/imputations/dataframe/single_imputer.py#L196 that returns an empty x_ and y_. This could be solved by selecting all samples with an observed y and filling all missing values in x_ with a mean/mode or random imputer instead of selecting only complete cases in listwise_delete.

I do not think this issue is related to the number of columns used (asw #65) as I could replicate it by slightly modifying the titanic dataset. Here is the code illustrating the issue. I voluntary introduce missing values in each row but none of the rows neither column is completely missing thus imputation should be usable.

del_cols = ["name", "ticket", "cabin", "survived"]
binary_var = ["sex"]
categorical_var = ["embarked"]
categ_vars = binary_var+categorical_var

startegy_dict = {}

for var in titanic_miss.columns:
    if var in categ_vars:
        titanic[var] = titanic[var].astype("category").cat.codes.astype(float)
    if var in binary_var:
        startegy_dict[var] = "binary logistic"
    elif var in categ_vars:
        startegy_dict[var] = "multinomial logistic"
    else:
        startegy_dict[var] = "least squares"

#generate at least a missing value in each row
def add_miss(x):
    if x.isnull().any():
        return x
    else:
        x.loc[x.sample().index] = np.nan
        return x

np.random.seed(42)
titanic_miss = titanic.drop(del_cols, axis=1).apply(add_miss, axis=1)

titanic_miss
sex age sibsp parch fare embarked
NaN 29.0000 0.0 0.0 211.3375 2.0
1.0 0.9167 1.0 NaN 151.5500 2.0
0.0 2.0000 1.0 2.0 NaN 2.0
NaN 30.0000 1.0 2.0 151.5500 2.0
0.0 25.0000 NaN 2.0 151.5500 2.0
... ... ... ... ... ...
0.0 14.5000 1.0 NaN 14.4542 0.0
0.0 NaN 1.0 0.0 14.4542 0.0
1.0 26.5000 0.0 NaN 7.2250 0.0
1.0 27.0000 0.0 0.0 7.2250 NaN
1.0 NaN 0.0 0.0 7.8750 2.0
imp = MiceImputer(strategy=startegy_dict, return_list=True)

imp.fit_transform(titanic_miss)

------------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-111-36cd6124a8eb> in <module>
      1 imp = MiceImputer(strategy=startegy_dict, return_list=True)
      2 
----> 3 imp.fit_transform(titanic_miss)

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\imputations\dataframe\multiple_imputer.py in fit_transform(self, X, y, **trans_kwargs)
    240     def fit_transform(self, X, y=None, **trans_kwargs):
    241         """Convenience method to fit then transform the same dataset."""
--> 242         return self.fit(X, y).transform(X, **trans_kwargs)

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
     59             err = f"Neither {d_err} nor {a_err} are of type pd.DataFrame"
     60             raise TypeError(err)
---> 61         return func(d, *args, **kwargs)
     62     return wrapper
     63 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
    124 
    125         # return func if no missingness violations detected, then return wrap
--> 126         return func(d, *args, **kwargs)
    127     return wrapper
    128 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
    171             err = f"All values missing in column(s) {nc}. Should be removed."
    172             raise ValueError(err)
--> 173         return func(d, *args, **kwargs)
    174     return wrapper
    175 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\imputations\dataframe\multiple_imputer.py in fit(self, X, y)
    196                 visit=self.visit
    197             )
--> 198             imputer.fit(X)
    199             self.statistics_[i] = imputer
    200 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
     59             err = f"Neither {d_err} nor {a_err} are of type pd.DataFrame"
     60             raise TypeError(err)
---> 61         return func(d, *args, **kwargs)
     62     return wrapper
     63 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
    124 
    125         # return func if no missingness violations detected, then return wrap
--> 126         return func(d, *args, **kwargs)
    127     return wrapper
    128 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
    171             err = f"All values missing in column(s) {nc}. Should be removed."
    172             raise ValueError(err)
--> 173         return func(d, *args, **kwargs)
    174     return wrapper
    175 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\imputations\dataframe\single_imputer.py in fit(self, X, y, imp_ixs)
    199                 x_ = _one_hot_encode(x_)
    200 
--> 201                 imputer.fit(x_, y_)
    202 
    203             # finally, store imputer for each column as statistics

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\imputations\series\logistic_regression.py in fit(self, X, y)
     61             err = "Binary requires 2 categories. Use multinomial instead."
     62             raise ValueError(err)
---> 63         self.glm.fit(X, y.codes)
     64         self.statistics_ = {"param": y.categories, "strategy": self.strategy}
     65         return self

~\anaconda3\envs\silversight\lib\site-packages\sklearn\linear_model\_logistic.py in fit(self, X, y, sample_weight)
   1340             _dtype = [np.float64, np.float32]
   1341 
-> 1342         X, y = self._validate_data(X, y, accept_sparse='csr', dtype=_dtype,
   1343                                    order="C",
   1344                                    accept_large_sparse=solver != 'liblinear')

~\anaconda3\envs\silversight\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    430                 y = check_array(y, **check_y_params)
    431             else:
--> 432                 X, y = check_X_y(X, y, **check_params)
    433             out = X, y
    434 

~\anaconda3\envs\silversight\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~\anaconda3\envs\silversight\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    793         raise ValueError("y cannot be None")
    794 
--> 795     X = check_array(X, accept_sparse=accept_sparse,
    796                     accept_large_sparse=accept_large_sparse,
    797                     dtype=dtype, order=order, copy=copy,

~\anaconda3\envs\silversight\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~\anaconda3\envs\silversight\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    648         n_samples = _num_samples(array)
    649         if n_samples < ensure_min_samples:
--> 650             raise ValueError("Found array with %d sample(s) (shape=%s) while a"
    651                              " minimum of %d is required%s."
    652                              % (n_samples, array.shape, ensure_min_samples,

ValueError: Found array with 0 sample(s) (shape=(0, 5)) while a minimum of 1 is required.`

Finally if you could tell me more about what it doing this line: https://github.com/kearnz/autoimpute/blob/a214e7ad2c664cd6c57843934ebf159067d6261f/autoimpute/imputations/dataframe/mice_imputer.py#L136 since no fit is called before the transform.

I hope there are enougth details for you to help me solve this issue. Thank you for the nice work on the module.

Victor

kearnz commented 3 years ago

@stimpfli

Thanks for the detailed write up! Your write up makes sense to me, and I'm familiar with that source. We actually originally had placeholder values in the earlier days of the package. Let me check my original code, but I'll leave this issue open and get it on the roadmap. Don't have timelines at the moment, but looking to tackle a number of issues in the coming weeks, so should be soon.

Regarding your second question, The MiceImputer is a subclass of the MultipleImputer, so it uses the same fit and fit_transform methods but overrides the transform method itself in order to perform k column updates. In performing k column updates, it calls the underlying imputer's transform method k times. That transform method ensures fit has already been called, or it throws an error. Therefore, that transform function call you reference will only work if the underlying imputer has already been fit.

Let me know if you have any other questions!