SingleImputer training on complete cases only

It seems to me that your MiceImputer only uses complete cases to train the SingleImputers but from what I red about MICE imputation it should not be the case:

Step 1: A simple imputation, such as imputing the mean, is performed for every missing value in the dataset. These mean imputations can be thought of as “place holders.” Step 2: The “place holder” mean imputations for one variable (“var”) are set back to missing. Step 3: The observed values from the variable “var” in Step 2 are regressed on the other variables in the imputation model, which may or may not consist of all of the variables in the dataset. In other words, “var” is the dependent variable in a regression model and all the other variables are independent variables in the regression model. These regression models operate under the same assumptions that one would make when performing linear, logistic, or Poison regression models outside of the context of imputing missing data. See for citation: Azur, Melissa J., et al. "Multiple imputation by chained equations: what is it and how does it work?." International journal of methods in psychiatric research 20.1 (2011): 40-49. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

As a consequence when there is a missing value in each row a ValueError is throw similar to the one described in #65 that is to my opinion due to none of the sample surviving the line: https://github.com/kearnz/autoimpute/blob/a214e7ad2c664cd6c57843934ebf159067d6261f/autoimpute/imputations/dataframe/single_imputer.py#L196 that returns an empty x_ and y_. This could be solved by selecting all samples with an observed y and filling all missing values in x_ with a mean/mode or random imputer instead of selecting only complete cases in listwise_delete.

I do not think this issue is related to the number of columns used (asw #65) as I could replicate it by slightly modifying the titanic dataset. Here is the code illustrating the issue. I voluntary introduce missing values in each row but none of the rows neither column is completely missing thus imputation should be usable.

del_cols = ["name", "ticket", "cabin", "survived"]
binary_var = ["sex"]
categorical_var = ["embarked"]
categ_vars = binary_var+categorical_var

startegy_dict = {}

for var in titanic_miss.columns:
    if var in categ_vars:
        titanic[var] = titanic[var].astype("category").cat.codes.astype(float)
    if var in binary_var:
        startegy_dict[var] = "binary logistic"
    elif var in categ_vars:
        startegy_dict[var] = "multinomial logistic"
    else:
        startegy_dict[var] = "least squares"

#generate at least a missing value in each row
def add_miss(x):
    if x.isnull().any():
        return x
    else:
        x.loc[x.sample().index] = np.nan
        return x

np.random.seed(42)
titanic_miss = titanic.drop(del_cols, axis=1).apply(add_miss, axis=1)

titanic_miss

sex	age	sibsp	parch	fare	embarked
NaN	29.0000	0.0	0.0	211.3375	2.0
1.0	0.9167	1.0	NaN	151.5500	2.0
0.0	2.0000	1.0	2.0	NaN	2.0
NaN	30.0000	1.0	2.0	151.5500	2.0
0.0	25.0000	NaN	2.0	151.5500	2.0
...	...	...	...	...	...
0.0	14.5000	1.0	NaN	14.4542	0.0
0.0	NaN	1.0	0.0	14.4542	0.0
1.0	26.5000	0.0	NaN	7.2250	0.0
1.0	27.0000	0.0	0.0	7.2250	NaN
1.0	NaN	0.0	0.0	7.8750	2.0

imp = MiceImputer(strategy=startegy_dict, return_list=True)

imp.fit_transform(titanic_miss)

------------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-111-36cd6124a8eb> in <module>
      1 imp = MiceImputer(strategy=startegy_dict, return_list=True)
      2 
----> 3 imp.fit_transform(titanic_miss)

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\imputations\dataframe\multiple_imputer.py in fit_transform(self, X, y, **trans_kwargs)
    240     def fit_transform(self, X, y=None, **trans_kwargs):
    241         """Convenience method to fit then transform the same dataset."""
--> 242         return self.fit(X, y).transform(X, **trans_kwargs)

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
     59             err = f"Neither {d_err} nor {a_err} are of type pd.DataFrame"
     60             raise TypeError(err)
---> 61         return func(d, *args, **kwargs)
     62     return wrapper
     63 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
    124 
    125         # return func if no missingness violations detected, then return wrap
--> 126         return func(d, *args, **kwargs)
    127     return wrapper
    128 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
    171             err = f"All values missing in column(s) {nc}. Should be removed."
    172             raise ValueError(err)
--> 173         return func(d, *args, **kwargs)
    174     return wrapper
    175 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\imputations\dataframe\multiple_imputer.py in fit(self, X, y)
    196                 visit=self.visit
    197             )
--> 198             imputer.fit(X)
    199             self.statistics_[i] = imputer
    200 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
     59             err = f"Neither {d_err} nor {a_err} are of type pd.DataFrame"
     60             raise TypeError(err)
---> 61         return func(d, *args, **kwargs)
     62     return wrapper
     63 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
    124 
    125         # return func if no missingness violations detected, then return wrap
--> 126         return func(d, *args, **kwargs)
    127     return wrapper
    128 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\utils\checks.py in wrapper(d, *args, **kwargs)
    171             err = f"All values missing in column(s) {nc}. Should be removed."
    172             raise ValueError(err)
--> 173         return func(d, *args, **kwargs)
    174     return wrapper
    175 

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\imputations\dataframe\single_imputer.py in fit(self, X, y, imp_ixs)
    199                 x_ = _one_hot_encode(x_)
    200 
--> 201                 imputer.fit(x_, y_)
    202 
    203             # finally, store imputer for each column as statistics

~\anaconda3\envs\silversight\lib\site-packages\autoimpute\imputations\series\logistic_regression.py in fit(self, X, y)
     61             err = "Binary requires 2 categories. Use multinomial instead."
     62             raise ValueError(err)
---> 63         self.glm.fit(X, y.codes)
     64         self.statistics_ = {"param": y.categories, "strategy": self.strategy}
     65         return self

~\anaconda3\envs\silversight\lib\site-packages\sklearn\linear_model\_logistic.py in fit(self, X, y, sample_weight)
   1340             _dtype = [np.float64, np.float32]
   1341 
-> 1342         X, y = self._validate_data(X, y, accept_sparse='csr', dtype=_dtype,
   1343                                    order="C",
   1344                                    accept_large_sparse=solver != 'liblinear')

~\anaconda3\envs\silversight\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    430                 y = check_array(y, **check_y_params)
    431             else:
--> 432                 X, y = check_X_y(X, y, **check_params)
    433             out = X, y
    434 

~\anaconda3\envs\silversight\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~\anaconda3\envs\silversight\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    793         raise ValueError("y cannot be None")
    794 
--> 795     X = check_array(X, accept_sparse=accept_sparse,
    796                     accept_large_sparse=accept_large_sparse,
    797                     dtype=dtype, order=order, copy=copy,

~\anaconda3\envs\silversight\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~\anaconda3\envs\silversight\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    648         n_samples = _num_samples(array)
    649         if n_samples < ensure_min_samples:
--> 650             raise ValueError("Found array with %d sample(s) (shape=%s) while a"
    651                              " minimum of %d is required%s."
    652                              % (n_samples, array.shape, ensure_min_samples,

ValueError: Found array with 0 sample(s) (shape=(0, 5)) while a minimum of 1 is required.`

Finally if you could tell me more about what it doing this line: https://github.com/kearnz/autoimpute/blob/a214e7ad2c664cd6c57843934ebf159067d6261f/autoimpute/imputations/dataframe/mice_imputer.py#L136 since no fit is called before the transform.

I hope there are enougth details for you to help me solve this issue. Thank you for the nice work on the module.

Victor

kearnz / autoimpute

SingleImputer training on complete cases only #68