scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.51k stars 25.28k forks source link

Add `return_predictions` option to the model_selection.cross_validate() API #13478

Open herrlich10 opened 5 years ago

herrlich10 commented 5 years ago

The cross_validate(), cross_val_predict(), and cross_val_score() methods from the model_selection module provide highly compact and convenient API for most daily works. Yet often times, one would like to get 1) multiple scores for both train and test, and 2) predictions for all samples using cross validation, which is an aggregation of test predictions from all folds.

Current API doesn't seem to allow both goals to be achieved in a single cv run. Is it possible to add something like a return_predictions option to the model_selection.cross_validate() API, as in here, to simply allow access to the results (which I believe have already been computed inside the function)?

Thank you!

jnothman commented 5 years ago

I think this is a feature many people would appreciate... It allows you to do more error analysis (although it's open to abuse and misinterpretation in that context). What do others think? I don't know if it's better to return the predictions for each split, or to combine them into a vector as cross_val_predict does.

okz12 commented 5 years ago

You can set the argument return_estimator = True, but that wouldn't be useful unless you know what subset of data it is trained on. So, I agree it would be good to return the complete predictions.

If you keep the predictions separate, the output will be in line with the return_estimator which returns an estimator for each cv split, rather than combine it into a vector like cross_val_predict. To recheck any result, you will have the estimator and its corresponding predictions.

On the other hand, for most use cases, you will just need the predictions combined, like cross_val_predict, but you lose out on split/index information. It could also be useful to have another flag to return indices of each split and you can work backwards if you need the predictions for a specific set.

If the update is approved, I'd like to take it up as my first issue (I worked on #13464 without realising it had already been resolved in #13465, so I don't think it counts)

jnothman commented 5 years ago

I think we need to be careful not to add too much complexity to the cross_validate API.

I'd like to hear from other core devs on this one.

adrinjalali commented 5 years ago

I also would rather not complicate the API too much. Do we have a strong preference against:

return_includes : string or list of strings (default='scores')
   - scores: ...
    - train_scores: ...
    - estimators: ...
    - train_predictions: ...
    - test_predictions: ...
    - cv_indices: ...
    - ...
    The output of the function is a dictionary with the items of the given list as
    keys, and the values are a list of length cv count, and they are synced by cv.
jnothman commented 5 years ago

Interesting proposal. It's not really the sort of API we use here much.

One small problem I have with it is that I hadn't imagined the predictions being returned in the dict because they wouldn't straightforwardly be used as an array of length n_splits

thomasjpfan commented 5 years ago

If we want the smallest change possible, we would only need to add return_cv_indices. Together with the return_estimator, a user can recreate the predictions. This also allows the user to know which split a estimator was trained on.

adrinjalali commented 5 years ago

One small problem I have with it is that I hadn't imagined the predictions being returned in the dict because they wouldn't straightforwardly be used as an array of length n_splits

My idea was to have the predictions as a list of length n_split of arrays os length split_size. That with another cv_indices included in the same way and shape would give the users all they need.

smarie commented 5 years ago

This issue was a critical one for one of our machine learning components in production - because we needed to compute several performance metrics for train and test, and also confidence intervals.

This is what I ended up doing: first the details are created in a simple format that looks quite like @adrinjalali 's proposal:

def run_sklearn_cv_with_details(X, y, cv, model, logger=default_logger):
    """
    Runs a cross-validation manually as suggested by scikit-learn forums, in order to get details to
    be used for plots and custom performance evaluation.

    Indeed there is currently no way in scikit learn to get the intermediate predictions in a simple 
    way:

     - there is no way to remember the predictions of the best parameters in Grid Search today.
       So we have to perform an EXTRA cross-validation with the found best parameters.
       see https://github.com/scikit-learn/scikit-learn/issues/5030

     - same for simple cv: either you use cross_validate or cross_val_score to get the scores, OR 
       you use cross_val_predict to get the predictions. BUT this only returns test predictions AND 
       only works if the cv is a partition, not a single split or a time series split !!!

    see http://scikit-learn.org/stable/modules/cross_validation.html#obtaining-predictions-by-cross-validation
    and https://stackoverflow.com/questions/43275506/scikit-learn-cross-val-predict-only-works-for-partitions

    :return: the list of split indices (train/test) and the list of predictions (on train/test).
    """
    # init variables
    idx_train_test = []
    y_pred_train_test = []

    # save original model to clone each for each split
    orig_model = model

    for split_i, (idx_train, idx_test) in enumerate(cv.split(X, y)):
        logger.info(" -- Split {}".format(split_i + 1))

        # remember the train/test indices for that split
        idx_train_test.append((idx_train, idx_test))

        logger.info(" -- Training temporary model")
        model = clone(orig_model)  # create a copy first
        model.fit(X[idx_train], y[idx_train])

        logger.info(" -- Applying temporary model on Training set")
        y_train_predicted = model.predict(X[idx_train])

        logger.info(" -- Applying temporary model on Test set")
        y_test_predicted = model.predict(X[idx_test])

        # remember predictions
        y_pred_train_test.append((y_train_predicted, y_test_predicted))

    return idx_train_test, y_pred_train_test

then I transform the outputs of this function to a single dataframe to ease passing it around to performance metrics and confidence intervals computation methods. In this second format, for each fold, the err_df dataframe contains two columns:

Here is the associated code extract:

# first run the above to get the details
idx_train_test, y_pred_train_test = run_sklearn_cv_with_details(X, y, cv, model, logger=logger)

folds_columns = [None] * len(idx_train_test)
for fold_ in range(len(idx_train_test)):
    # get the training/test sets for this fold and associated predictions
    idx_train, idx_test = idx_train_test[fold_]
    y_pred_train, y_pred_test = y_pred_train_test[fold_]

    fold_prediction_col = PREDICTED_VAL_COL + "_" + str(fold_)
    fold_datasets_col = DATASETS_COL + "_" + str(fold_)

    # create a prediction column containing all predictions for this fold (train and test)
    err_df[fold_prediction_col] = np.nan
    fold_prediction_col_idx = err_df.columns.get_loc(fold_prediction_col)
    err_df.iloc[idx_train, fold_prediction_col_idx] = y_pred_train
    err_df.iloc[idx_test, fold_prediction_col_idx] = y_pred_test

    # create a dataset column containing 'train' or 'test' whether a sample was one or the other in this fold
    err_df[fold_datasets_col] = ''
    fold_datasets_col_idx = err_df.columns.get_loc(fold_datasets_col)
    err_df.iloc[idx_train, fold_datasets_col_idx] = 'train'
    err_df.iloc[idx_test, fold_datasets_col_idx] = 'test'

    # save the column names for reference
    folds_columns[fold_] = (fold_prediction_col, fold_datasets_col)

I think that the issue with the second format is that it does not naturally extend to overlapping splits, but not sure we have any in sklearn as of today. In other words, for a single split or cv fold, a sample has to be only picked once, either in 'train' or in 'test', and has only one prediction associated to it. Whereas the first format is more low-level: it could naturally expand to overlapping or resampled (bootstrap) splits.

My suggestion would therefore be to use the first format (= two lists of 2-tuples) for the output, as proposed by @adrinjalali, and to provide an optional helper function doing the transform to the second format (=one dataframe), for those users whishing to get a handy dataframe.

NicolasHug commented 5 years ago

My 2 cents:

I agree with @thomasjpfan that an easy way to support this is simply to return the cv indices. It is more useful than to just return the predictions, for analysis purpose. As cross_validate is some sort of cross_val_score on steroids, it'd make sense to me to add it here.

okz12 commented 5 years ago

I've made a PR with an attempt to make minimum possible changes to the API while being able to retrieve CV splits/predictions. In the PR, I simply returned the test indices for each CV split from the fitting and scoring function back through cross_validate. Keen to hear feedback/thoughts on the approach.

smarie commented 5 years ago

@okz12 , can you provide a sample code to demonstrate how a user would be able to retrieve the predictions for all folds and train/test sets using the proposed API ? It does not seem so trivial to me, @thomasjpfan mentioned that users would be able to get them but it seems that they would need to re-apply the predictors manually, correct ?

thomasjpfan commented 5 years ago

but it seems that they would need to re-apply the predictors manually, correct ?

@smarie Yes it is a minimal change, giving the user just enough information to recreate the predictions.

okz12 commented 5 years ago

The predictions can be obtained by returning estimators with the test_indices and using predict a second time, here's a test case I've added:

    clf = SVC(kernel="linear", random_state=0)
    iris = load_iris()
    X, y = iris['data'], iris['target']
    cv = 5

    ret = cross_validate(clf, X, y, cv=cv, return_estimator=True,
                         return_test_indices=True)
    split_test_scores = np.zeros(cv)

    # reconstruct predictions for each cv split using estimator
    for idx, (clf, split_indices) in enumerate(zip(ret['estimator'], ret['test_indices'])):
        split_preds = clf.predict(X[split_indices])
        split_test_scores[idx] = accuracy_score(y[split_indices], split_preds)

    assert_array_almost_equal(ret['test_score'], split_test_scores)

Training indexes can be obtained by concatenating the remaining indices and leaving out the test indices from the split:

train_indices = []
for cv_split_idx in range(cv):
    train_indices.append(
        np.hstack(index_array for test_idx, index_array\
        in enumerate(ret['test_indices'])
        if test_idx != cv_split_idx)
    )
smarie commented 5 years ago

Ok. I personally think that it is a bit sad that we have to apply the predictors again. But there is maybe a good reason? Maybe parallel mode?

adrinjalali commented 5 years ago

I know my suggestion is not the norm, but it's not a bad one, you really don't like it @thomasjpfan ? :P

okz12 commented 5 years ago

So, is there a consensus on whether we should go ahead with a minimal change to the API or include a full set of options on what to return?

thomasjpfan commented 5 years ago

The idea in https://github.com/scikit-learn/scikit-learn/issues/13478#issuecomment-475205957 is okay. Are the train_predictions or test_predictions useful without cv_indices? Are we going to want to extend this to results from predict_proba in the future?

smarie commented 5 years ago

Are the train_predictions or test_predictions useful without cv_indices?

Yes if the only thing you're interested in is the distribution of the prediction errors. You can even be only interested in the test_predictions.

@adrinjalali the cv_indices would be a list of length n_split too, but I guess each entry would be a tuple of two arrays containing the indices, no? Just like what I proposed in the first part of answer https://github.com/scikit-learn/scikit-learn/issues/13478#issuecomment-475602608 : for each split cv_indices.append((idx_train, idx_test)). If not, what are the alternatives ?

jnothman commented 5 years ago

I think wanting to get predictions as well as scores is a common use-case, albeit open to misuse. To some extent I'd rather have return_predictions which would return test set predictions as well as indices (under a separate key) rather than return_includes. One reason is that there's no harm including CV indices if returning predictions, and it might help guide the users towards reasonable usage patterns. One reason is that some things should be returned without being asked for, and multi-metric scoring naturally expands the set of returned keys. One reason is that it's better for return_train_score to be available consistent with *SearchCV.

mostafahadian commented 5 years ago

There's also some workaround I used, in order to get the predictions of cross_validate: I set (return_estimator = True) in cross_validate, and also gave my own KFold instance (cv=my_kfold) as input. Then I use my own method to predict and aggregate the predictions from different folds:

def my_cross_val_predict(X, Y, my_kfold, estimators):
    x_all = np.ndarray([0, X.shape[1]])
    y_all = np.ndarray([0])
    y_pred_all = np.ndarray([0])
    indices_all = np.ndarray([0])
    for i, (_, test_index) in enumerate(my_kfold.split(X, Y)):
        estimator = estimators[i]
        x = X[test_index, :]
        y = Y[test_index]
        y_pred = estimator.predict(x)
        x_all = np.concatenate([x_all, x], axis=0)
        y_all = np.concatenate([y_all, y], axis=0)
        y_pred_all = np.concatenate([y_pred_all, y_pred], axis=0)
        indices_all = np.concatenate([indices_all, test_index], axis=0)

    return x_all, y_all, y_pred_all, indices_all

So by feeding a kfold instance to the cross_validate method, we will be able to get all the predictions without re-training our model.

amueller commented 5 years ago

I don't really like returning the indices and models. You're still requiring the user to reimplement quite a bit, why wouldn't they just completely reimplement it as @mostafahadian did?

I would favor adding a return_predictions (and possibly predict_method) to cross-validate, which would provide per-fold predictions. I think the user can be trusted to hstack them.

[after reading the above more carefully: as usual I agree with @jnothman]

jnothman commented 5 years ago

return_predictions is okay by me. I think we can consider allowing it to be a string to avoid an extra predict_method param???

lucyleeow commented 4 years ago

For the return_predictions implementation suggested by @jnothman (https://github.com/scikit-learn/scikit-learn/issues/13478#issuecomment-494703178), are we okay with the predictions being calculated twice (as in #18390 )? Once when we calculate the score and again to return the predictions. Avoiding this would involve complex changes.