scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.06k stars 25.39k forks source link

Allow NaNs for the target values in TransformedTargetRegressor #11339

Open vahidbas opened 6 years ago

vahidbas commented 6 years ago

Description

One potential use case for TransformedTargetRegressor is to get rid of missing values in the target. but currently initial check of the fit method doesn't allow such array.

Steps/Code to Reproduce

Example:

from sklearn.compose import TransformedTargetRegressor
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeRegressor
from sklearn import datasets

X, y = datasets.load_linnerud(return_X_y=True)

## put some NaN in y
y[5, 1] = np.NaN

estimator = TransformedTargetRegressor(
    regressor=DecisionTreeRegressor(),
    func = lambda _y: SimpleImputer().fit_transform(_y), # becuse SimpleImputer doesnt have inverse
    inverse_func = lambda _y: _y,
    check_inverse = False
)

estimator.fit(X, y)

This raises: ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

jnothman commented 6 years ago

Perhaps you're right... but when is it a good idea to mean-impute the regression target??​

vahidbas commented 6 years ago

@jnothman impute is only for illustration. I have a relatively high dimensional noisy target which has occasional missing values. I would like to project it to some low dimensional space before using it as the predictor's target. The projection is of course lossy but it helps a lot in the accuracy of the predictive model and resolves missing value issue. I have some custom class doing this transformation and its inverse.

glemaitre commented 6 years ago

I am quite concerned to do something like that. How do you ensure to compute a proper score between a missing target and an imputed target. It seems something wrong to be done, isn't it.

jnothman commented 6 years ago

I suppose it's a way of doing semi-supervised learning.​..?

vahidbas commented 6 years ago

@jnothman a bit more relaxed that semi-supervised learning as the target might be partially missing for a sample while in semi-supervised the target is fully missing for a sample.

@glemaitre Missing values will be ignored in the computation of the score, what can go wrong? Example:


def r2_score_with_nan(y_true, y_pred):
    numerator = np.nansum((y_true - y_pred) ** 2, axis=0, dtype=np.float64)
    denominator = np.nansum((y_true - np.nanmean(y_true, axis=0)) ** 2, axis=0,  dtype=np.float64)
    return np.mean(1 - numerator/denominator)

y_pred = np.random.randn(10, 3)
y_true = y_pred + np.random.randn(10, 3) * 0.1 
y_true[5, 1] = np.NaN

r2_score_with_nan(y_true, y_pred)
jnothman commented 6 years ago

Okay, let's minimise validation.

PR welcome.

shreyasramachandran commented 6 years ago

The problem is there at two places , where we need to change force_all_finite to false and the other place is at _function-transformer.py line 103 . Just checked the commit you mentioned it also corrects the first part , but even if the first part is corrected the issue will not be resolved as after it line 179 in _target.py will produced an error which can be traced back to the other part I have mentioned above .

coderop2 commented 5 years ago

If nobody is handling the issue may have a look at it..?

jnothman commented 5 years ago

This is fixed in #11349, which requires a second review

stelios357 commented 4 years ago

Is this issue still open ?

jnothman commented 4 years ago

Yes, it seems so :|

boricles commented 4 years ago

Hi, I am new here. is there someone working on this? ... I can give a try and take this one for the weekend?