iskandr / fancyimpute

Multivariate imputation and matrix completion algorithms implemented in Python
Apache License 2.0
1.25k stars 178 forks source link

BiScaler doesn't comform to scikit-learn API #6

Closed sergeyf closed 8 years ago

sergeyf commented 8 years ago

I tried to stick BiScaler into a scikit-learn pipeline as follows:

biscaler = BiScaler()
model = Ridge()
pipeline = Pipeline([('biscaler',biscaler),('model',model)])

But it doesn't seem to work. I get the following error:

Traceback (most recent call last):

  File "<ipython-input-18-ecdf2c799537>", line 5, in <module>
    pipeline.fit(X[train], y[train])

  File "D:\Anaconda2\lib\site-packages\sklearn\pipeline.py", line 164, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)

  File "D:\Anaconda2\lib\site-packages\sklearn\pipeline.py", line 145, in _pre_transform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])

TypeError: fit_transform() takes exactly 2 arguments (3 given)

when trying this code:

n,d = X.shape
kfold = KFold(n=n,n_folds=5,shuffle=True)
errors = []
for k, (train, test) in enumerate(kfold):
    pipeline.fit(X[train], y[train])
    y_est = pipeline.predict(X[test])
    errors.append(np.mean(np.abs(y_est-y[test])))
sergeyf commented 8 years ago

Erm, maybe I'm being dumb here. BiScaler learns row means, but the test set has totally different rows. Is this thing really not applicable to non-transductive cases? Do you just have to apply it separately to X_train and X_test? That is, do fit_transform twice? That seems... wrong?

iskandr commented 8 years ago

I think that the only way to apply non-transductively is to biscale a training set, save the feature parameters (mean/std) and then row scale the test set.

Should that be the default behavior of transform? This might make inverse_transform particularly confusing since it could only be used on the most recent dataset for transform was called.

sergeyf commented 8 years ago

I wonder if this kind of non-transductive application would totally ruin predictive power... Best to just leave it alone I think