Scorers might need to know about training and testing data

amueller commented 8 years ago

This is not a PR because I didn't write this yet. It's more a very loose RFC.

I think scorers might need to be able to distinguish between training and test data. I think there were more cases but there are two obvious ones: the R^2 is currently computed using the test-set mean. That seems really odd, and breaks for LOO. When doing cross-validation, the classes that are present can change, which can impact things like macro-f1 in weird ways, and can also lead to errors in LOO (https://github.com/scikit-learn/scikit-learn/issues/4546)

I'm not sure if this is a good enough case yet, but I wanted somewhere to take a note ;)

GaelVaroquaux commented 8 years ago

Thanks for the note!

LOO is really a bad cross-validation strategy [*]. I wonder if we should base our design for it to work, or just push even more for people not to use it.

[*] I had an insight yesterday on a simple reason why: the measurement error of the score on the test set goes as sqrt(n_test), as any unbiased statistic. sqrt climbs very fast in the beginning. In this part of the regime, you are better off depleting the train set to benefit from the steep rise.

jnothman commented 8 years ago

Is LOO more acceptable when used like some_score(cross_val_predict(X, y, cv=LOO()), y)?

On 23 August 2016 at 16:11, Gael Varoquaux notifications@github.com wrote:

Thanks for the note!

LOO is really a bad cross-validation strategy [*]. I wonder if we should base our design for it to work, or just push even more for people not to use it.

[*] I had an insight yesterday on a simple reason why: the measurement error of the score on the test set goes as sqrt(n_test), as any unbiased statistic. sqrt climbs very fast in the beginning. In this part of the regime, you are better off depleting the train set to benefit from the steep rise.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/issues/3#issuecomment-241636503, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz676LUKPMAknFcLJNcoMKV8ZG160Yks5qio8ZgaJpZM4JqWap .

GaelVaroquaux commented 8 years ago

Is LOO more acceptable when used like some_score(cross_val_predict(X, y, cv=LOO()), y)?

No. I believe that that's actually wrong. You are no longer computing the expectancy of the error of the predictive model.

One way of convincing you that you are not computing the same thing is to think of the correlation score: it's quite clear that it can be very different between the two approaches.

To convince you that it's the "wrong" thing, I think that the right though to have in mind is that the cross-val score is the expectation on the test data, of the prediction error of the model (formula 1 in http://arxiv.org/pdf/1606.05201.pdf), it's actual a double expectation: if l_M, is the expectation of the error of the model, the score is E[l_M] where the expectation is taken on the data to train the model. http://projecteuclid.org/download/pdfview_1/euclid.ssu/1268143839 has a good analysis of this, including the classic split of l_M in approximation error and estimation error.

Using score(cross_val_predict) is not computing that. It's computing the expectation of l_M jointly on the train and test data. Given that the 2 are not independent, it's not the same thing as the successive expectation.

Actually, now that I realize it, "cross_val_predict" is probably used massively to compute things that shouldn't be computed.

jnothman commented 8 years ago

Thanks for the response.

Yes, the case of correlation (or ROC, or anything where output over samples is compared) is convincing, but not immediately convincing that this issue extends to sample-wise measures.

I'm a bit weak on this theory, but I think I get the picture. I hope I find time to read Arlot and Celisse to solidify it.

And while the proposed intension of cross_val_predict was visualisation, you're probably right that it's licensing some invalid conclusions. :/

amueller commented 8 years ago

So the thing is that R^2, our default regression metric, is not a sample wise measurement. Also, for ROC curves (and AUC and average precision) there is an issue with interpolation, which should be done using the training set or a validation set. Actually I'm currently not sure what the right way to compute AUC is.

Sent from phone. Please excuse spelling and brevity.

On Aug 23, 2016 02:55, "Joel Nothman" notifications@github.com wrote:

Thanks for the response.

Yes, the case of correlation (or ROC, or anything where output over samples is compared) is convincing, but not immediately convincing that this issue extends to sample-wise measures.

I'm a bit weak on this theory, but I think I get the picture. I hope I find time to read Arlot and Celisse to solidify it.

And while the proposed intension of cross_val_predict was visualisation, you're probably right that it's licensing some invalid conclusions. :/

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/enhancement_proposals/issues/3#issuecomment-241643596, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFklFbA_rosqGw6iLhx_k5v1xuEw8ks5qiplsgaJpZM4JqWap .

scikit-learn / enhancement_proposals

Scorers might need to know about training and testing data #3