Closed Cookiee-monster closed 11 months ago
Yes, generally this is the right method, though it is unlikely to have a big effect here as (assuming samples are from the same population) the standardization transformations should be quite similar.
We may switch this later, but in this case, we have followed the calculations done in the R version.
The issue of validation is addressed in Ch5 in more detail.
From: Grzegorz Kuprewicz @.> Sent: Friday, December 1, 2023 11:59 AM To: intro-stat-learning/ISLP_labs @.> Cc: Subscribed @.***> Subject: [intro-stat-learning/ISLP_labs] Lab Chapter 04 - standarization done before train/test split (Issue #21)
Hi,
I think any preprocessing like standardization should be done before the train/test split to avoid data leakage. In Lab 04, in cells 53-57 the whole dataset is first standardized and then split. I suggest the train / test split should be done as a first step. Then a standardization fit_transform() should take place on X_train only and then finally scaler.transform() on X_test. This approach avoids the
(X_train, X_test, y_train, y_test) = train_test_split(np.asarray(feature_std), Purchase, test_size=1000, random_state=0)
BR Grzegorz
— Reply to this email directly, view it on GitHubhttps://github.com/intro-stat-learning/ISLP_labs/issues/21, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACTM2YJDO4Z7HAIPQ5DKADYHIZLVAVCNFSM6AAAAABADKQ55CVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZDCNJQGU3TQOI. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Apologies, my answer is a little off. The correct way to do things (and how sklearn
does them) is to scale the training data and then apply this same scaling to test data. It's easy to see that this is how things should be done if the transformer were PCA instead of just standardization.
Here's a little example demonstrating this: we compute test error of an estimator / transformer composition using cross_validate
and do it by hand. MSE is identical...
import numpy as np
from sklearn.base import clone
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import ShuffleSplit, cross_validate
n, p = 100, 20
rng = np.random.default_rng(0)
Y = rng.standard_normal(n)
X = rng.standard_normal((n, p))
# A pipeline to take first 3 principal components
pcaX = PCA(n_components=3)
pcaX.fit(X)
pca = PCA(n_components=3)
lm = LinearRegression(fit_intercept=False)
pipe = Pipeline([('pca', pca), ('lm', lm)])
test_split = ShuffleSplit(test_size=20, random_state=0, n_splits=1)
# Work out test error of our pipeline
cross_validate(pipe, X, Y, cv=test_split, scoring='neg_mean_squared_error')
# Show that this fits the transform on train, freezes it and applies on test
splitter = test_split.split(np.arange(n))
train, test = [_ for _ in splitter][0]
pipe_train = pipe.fit(X[train], Y[train])
Yhat_test = pipe_train.predict(X[test])
print(((Y[test] - Yhat_test)**2).mean(), 'using pipeline')
# do the pieces individually as well
pca_train = clone(pca)
pca_train.fit(X[train])
lm_train = clone(lm)
lm_train.fit(pca_train.transform(X[train]), Y[train])
Yhat_test_by_hand = lm_train.predict(pca_train.transform(X[test]))
print(((Y[test] - Yhat_test_by_hand)**2).mean(), 'doing steps by hand')
Hi @jonathan-taylor ,
I think any preprocessing like standardization should be done before the train/test split to avoid data leakage. In Lab 04, in cells 53-57 the whole dataset is first standardized and then split. I suggest the train / test split should be done as a first step. Then a standardization fit_transform() should take place on X_train only and then finally scaler.transform() on X_test. This approach avoids the
https://github.com/intro-stat-learning/ISLP_labs/blob/dad0773f3f3bea77f190bb97e79bbfa7cbd52df6/Ch04-classification-lab.ipynb#L2889
BR Grzegorz