intro-stat-learning / ISLP_labs

Up-to-date version of labs for ISLP
BSD 2-Clause "Simplified" License
660 stars 388 forks source link

Lab Chapter 04 - standarization done before train/test split #21

Closed Cookiee-monster closed 9 months ago

Cookiee-monster commented 9 months ago

Hi @jonathan-taylor ,

I think any preprocessing like standardization should be done before the train/test split to avoid data leakage. In Lab 04, in cells 53-57 the whole dataset is first standardized and then split. I suggest the train / test split should be done as a first step. Then a standardization fit_transform() should take place on X_train only and then finally scaler.transform() on X_test. This approach avoids the

https://github.com/intro-stat-learning/ISLP_labs/blob/dad0773f3f3bea77f190bb97e79bbfa7cbd52df6/Ch04-classification-lab.ipynb#L2889

(X_train,
 X_test,
 y_train,
 y_test) = train_test_split(np.asarray(feature_std),
                            Purchase,
                            test_size=1000,
                            random_state=0)

BR Grzegorz

jonathan-taylor commented 9 months ago

Yes, generally this is the right method, though it is unlikely to have a big effect here as (assuming samples are from the same population) the standardization transformations should be quite similar.

We may switch this later, but in this case, we have followed the calculations done in the R version.

The issue of validation is addressed in Ch5 in more detail.


From: Grzegorz Kuprewicz @.> Sent: Friday, December 1, 2023 11:59 AM To: intro-stat-learning/ISLP_labs @.> Cc: Subscribed @.***> Subject: [intro-stat-learning/ISLP_labs] Lab Chapter 04 - standarization done before train/test split (Issue #21)

Hi,

I think any preprocessing like standardization should be done before the train/test split to avoid data leakage. In Lab 04, in cells 53-57 the whole dataset is first standardized and then split. I suggest the train / test split should be done as a first step. Then a standardization fit_transform() should take place on X_train only and then finally scaler.transform() on X_test. This approach avoids the

https://github.com/intro-stat-learning/ISLP_labs/blob/dad0773f3f3bea77f190bb97e79bbfa7cbd52df6/Ch04-classification-lab.ipynb#L2889

(X_train, X_test, y_train, y_test) = train_test_split(np.asarray(feature_std), Purchase, test_size=1000, random_state=0)

BR Grzegorz

— Reply to this email directly, view it on GitHubhttps://github.com/intro-stat-learning/ISLP_labs/issues/21, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACTM2YJDO4Z7HAIPQ5DKADYHIZLVAVCNFSM6AAAAABADKQ55CVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZDCNJQGU3TQOI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

jonathan-taylor commented 9 months ago

Apologies, my answer is a little off. The correct way to do things (and how sklearn does them) is to scale the training data and then apply this same scaling to test data. It's easy to see that this is how things should be done if the transformer were PCA instead of just standardization.

Here's a little example demonstrating this: we compute test error of an estimator / transformer composition using cross_validate and do it by hand. MSE is identical...

import numpy as np
from sklearn.base import clone
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import ShuffleSplit, cross_validate

n, p = 100, 20

rng = np.random.default_rng(0)
Y = rng.standard_normal(n)
X = rng.standard_normal((n, p))

# A pipeline to take first 3 principal components

pcaX = PCA(n_components=3)
pcaX.fit(X)

pca = PCA(n_components=3)
lm = LinearRegression(fit_intercept=False)
pipe = Pipeline([('pca', pca), ('lm', lm)])

test_split = ShuffleSplit(test_size=20, random_state=0, n_splits=1)

# Work out test error of our pipeline

cross_validate(pipe, X, Y, cv=test_split, scoring='neg_mean_squared_error')

# Show that this fits the transform on train, freezes it and applies on test

splitter = test_split.split(np.arange(n))
train, test = [_ for _ in splitter][0]

pipe_train = pipe.fit(X[train], Y[train])
Yhat_test = pipe_train.predict(X[test])
print(((Y[test] - Yhat_test)**2).mean(), 'using pipeline')

# do the pieces individually as well

pca_train = clone(pca)
pca_train.fit(X[train])
lm_train = clone(lm)
lm_train.fit(pca_train.transform(X[train]), Y[train])

Yhat_test_by_hand = lm_train.predict(pca_train.transform(X[test]))
print(((Y[test] - Yhat_test_by_hand)**2).mean(), 'doing steps by hand')