Missing train/test split in exercise 5

kousu commented 1 year ago

https://github.com/neurodatascience/main-edu-courses-ml/blob/7d354ee0ccc3d65770e7bb26c782bc0b6097f2dd/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py#L10-L14

asks "what is wrong with this code?"

The solution claims to use a training set and implies that was the big problem:

https://github.com/neurodatascience/main-edu-courses-ml/blob/7d354ee0ccc3d65770e7bb26c782bc0b6097f2dd/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py#L16

but doesn't seem to do train-test splitting either:

https://github.com/neurodatascience/main-edu-courses-ml/blob/7d354ee0ccc3d65770e7bb26c782bc0b6097f2dd/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py#L27-L30

The plot also claims that the solution plots only on training data:

https://github.com/neurodatascience/main-edu-courses-ml/blob/7d354ee0ccc3d65770e7bb26c782bc0b6097f2dd/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py#L32-L39

Figure_1

but @valosekj and I don't understand how that is true. What it's actually plotting is the CV scores generated on the reduced data vs the CV scores on the full set (but trained on the reduced set). So it's just demonstrating the effect of overfitting. That's not the same as a train/test split.

By the way, the sklearn docs show using train_test_split explicitly with Pipeline (aka make_pipeline)

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipe.fit(X_train, y_train)

So is this solution wrong?

jeromedockes commented 1 year ago

Hi, thanks for looking at the exercise in more detail and asking your question here!

The data is splitted into training and testing sets; that is done by the cross_validate function.

For each fold of cross validation, cross_validate will:

split the data into a train set and a test set
fit the estimator (in this case the Pipeline) on the train set
evaluate it on the test set and remember the score. Finally it returns all the resulting scores.

So it does the same thing as the example from the scikit-learn documentation you show (using train_test_split), but it does it 5 times (for 5 different splits) instead of once.

(From the train_test_split documentation, train_test_split is equivalent to:

next(ShuffleSplit().split(X, y))

ShuffleSplit is the same as the default cross-validation iterator used by cross_validate for regression problems, ie KFold, except that ShuffleSplit shuffles the samples before each split.)

the difference between the first and the second box in the exercise is that in the first case, the feature selector has seen the whole data, whereas in the second case, it is fitted on the training set of each cross-validation split.

Rewriting the exercise solution with one split instead of 5 (ie with train_test_split instead of cross_validate) would look like this:

from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from matplotlib import pyplot as plt

X, y = make_regression(noise=10, n_features=5000, random_state=0)

# Question: what is the issue with the code below?

X_reduced = SelectKBest(f_regression).fit_transform(X, y)
X_reduced_train, X_reduced_test, y_train, y_test = train_test_split(
    X_reduced, y, shuffle=False, test_size=0.2
)
ridge = Ridge().fit(X_reduced_train, y_train)
predictions = ridge.predict(X_reduced_test)
score = r2_score(y_test, predictions)
print("feature selection in 'preprocessing':", score)

# Now fitting the whole pipeline on the training set only

# TODO:
# - use `make_pipeline` to create a pipeline chaining a `SelectKBest` and a
#   `Ridge`
# - use `cross_validate` to obtain cross-validation scores for the whole
#   pipeline treated as a single model
# See: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html
model = "make_pipeline(???)"
score_pipe = None
# TODO_BEGIN
model = make_pipeline(SelectKBest(f_regression), Ridge())
X_train, X_test, y_train, y_test = train_test_split(
    X, y, shuffle=False, test_size=0.2
)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score_pipe = r2_score(y_test, predictions)
# TODO_END
print("feature selection on train set:", score_pipe)

it prints:

feature selection in 'preprocessing': 0.8752332046109655
feature selection on train set: 0.2836725353159354

As you can see, these correspond to the last (5th) split in the cross_validate results. Indeed, the original solution (using cross_validate) prints:

feature selection in 'preprocessing': [0.81169757 0.63046326 0.54143034 0.72676923 0.8752332 ]
feature selection on train set: [ 0.11577751  0.0439333  -0.27625968  0.32327364  0.28367254]

(note the last value in each list)

I hope this helps clarify things but if not don't hesitate to ask more questions here!

kousu commented 1 year ago

Hello! Thank you for the kind time spent on writing this answer up!

I started working through it for myself and I have some questions and suggestions, but then I got busy with spine scans and whatnot. I'll try to get back to you this weekend though!

neurodatascience / main-edu-courses-ml

Missing train/test split in exercise 5 #1