kearnz / autoimpute

Python package for Imputation Methods
MIT License
237 stars 19 forks source link

dimension mismatch error on fit_transform for MultipleImputer #41

Closed gargashish11 closed 1 year ago

gargashish11 commented 4 years ago

I'm trying to impute the missing values using autoImpute package for the titanic test data set using the python autoimpute package. However, the module is throwing a dimension mismatch error on the test data set.

kaggle titanic test data

import pandas as pd
from autoimpute.imputations import MultipleImputer

## titanic database test csv
X_test = pd.read_csv('test.csv') ## response is missing
X_test = X_test.drop(labels=['PassengerId','Name','Ticket','Cabin'], axis=1)

test_imp = MultipleImputer(seed = 1, return_list=True,
                     strategy={"Age": "pmm", "Fare": "pmm"},
                     imp_kwgs={"pmm": {"normalize": True, "n_jobs" : -1}},
                     )
print(X_test.isnull().sum())
X_test = test_imp.fit_transform(X_test)[0][1]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-322f4b0b8702> in <module>
     13                      )
     14 print(X_test.isnull().sum())
---> 15 X_test = test_imp.fit_transform(X_test)[0][1]

~/intelpython3/lib/python3.7/site-packages/autoimpute/imputations/dataframe/multiple_imputer.py in fit_transform(self, X, y)
    229     def fit_transform(self, X, y=None):
    230         """Convenience method to fit then transform the same dataset."""
--> 231         return self.fit(X, y).transform(X)

~/intelpython3/lib/python3.7/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
     59             err = f"Neither {d_err} nor {a_err} are of type pd.DataFrame"
     60             raise TypeError(err)
---> 61         return func(d, *args, **kwargs)
     62     return wrapper
     63 

~/intelpython3/lib/python3.7/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
    124 
    125         # return func if no missingness violations detected, then return wrap
--> 126         return func(d, *args, **kwargs)
    127     return wrapper
    128 

~/intelpython3/lib/python3.7/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
    171             err = f"All values missing in column(s) {nc}. Should be removed."
    172             raise ValueError(err)
--> 173         return func(d, *args, **kwargs)
    174     return wrapper
    175 

~/intelpython3/lib/python3.7/site-packages/autoimpute/imputations/dataframe/multiple_imputer.py in transform(self, X)
    224                    for i in self.statistics_.items())
    225         if self.return_list:
--> 226             imputed = list(imputed)
    227         return imputed
    228 

~/intelpython3/lib/python3.7/site-packages/autoimpute/imputations/dataframe/multiple_imputer.py in <genexpr>(.0)
    222         # sequential only for now
    223         imputed = ((i[0], i[1].transform(X))
--> 224                    for i in self.statistics_.items())
    225         if self.return_list:
    226             imputed = list(imputed)

~/intelpython3/lib/python3.7/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
     59             err = f"Neither {d_err} nor {a_err} are of type pd.DataFrame"
     60             raise TypeError(err)
---> 61         return func(d, *args, **kwargs)
     62     return wrapper
     63 

~/intelpython3/lib/python3.7/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
    124 
    125         # return func if no missingness violations detected, then return wrap
--> 126         return func(d, *args, **kwargs)
    127     return wrapper
    128 

~/intelpython3/lib/python3.7/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
    171             err = f"All values missing in column(s) {nc}. Should be removed."
    172             raise ValueError(err)
--> 173         return func(d, *args, **kwargs)
    174     return wrapper
    175 

~/intelpython3/lib/python3.7/site-packages/autoimpute/imputations/dataframe/single_imputer.py in transform(self, X)
    261 
    262             # perform imputation given the specified imputer and value for x_
--> 263             X.loc[imp_ix, column] = imputer.impute(x_)
    264         return X
    265 

~/intelpython3/lib/python3.7/site-packages/autoimpute/imputations/series/pmm.py in impute(self, X)
    190         print(X.T.shape)
    191         print(X.T)
--> 192         y_pred_bayes = alpha_bayes + beta_bayes.dot(X.T)
    193         n_ = self.neighbors
    194         if X.columns.size == 1:

ValueError: shapes (7,) and (4,1) not aligned: 7 (dim 0) != 4 (dim 0)

However, the train dataset goes through the imputation without a problem. And the problem persists on the latest Anaconda distribution as well.

Also, the test dataset imputation works fine without the "Fare" part of the strategy. However, as a second step, the imputation with a new instance of the MultipleImputer with "Fare" and "completed" dataset from the first step, still produce same error.

Thanks a lot.

Thanks a lot.

varshithvvs commented 4 years ago

@gargashish11 I have been facing same issue any updates/solutions on the problem

gargashish11 commented 4 years ago

@varshithvvs haven't used this repo after the issue.

kearnz commented 4 years ago

@gargashish11 @varshithvvs apologies I missed this original issue.

Are you getting this error using solely titanic test data and the code above? If so, I can try to recreate.

varshithvvs commented 4 years ago

@kearnz I have tested only on the titanic data set, also this issue is vanishing after I rest my index. But still couldn't figure out a logical reason why that might be happening.

kearnz commented 4 years ago

@varshithvvs i'll take a look as soon as I can, likely later this week / weekend. thanks for your patience.

kearnz commented 1 year ago

closing this, see #68