rasbt / python-machine-learning-book-3rd-edition

The "Python Machine Learning (3rd edition)" book code repository
https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1789955750/
MIT License
4.6k stars 1.98k forks source link

where does the StandardScaler class 𝜇 and 𝜎 come from? the train data sets or the whole data sets? #135

Closed point6013 closed 4 years ago

point6013 commented 4 years ago

dear sir, it confused me a lot . we use the train data set's mean and standard deviation to standardized the test data set? what if the train and test diverate a lot, why don't use the whole data set' s 𝜇 and 𝜎 to accomdize the train and the test? so my question is that : the StandardScaler class use the train test's 𝜇 (sample mean) and 𝜎 (standard deviation), or the whole data sets's 𝜇 and 𝜎

rasbt commented 4 years ago

Hi there,

we use the train data set's mean and standard deviation to standardized the test data set?

Yes, this is correct.

what if the train and test diverate a lot,

That's a good point. First, we assume that training and test set are sampled from the same population. This is an assumption that underlies almost all of machine learning concepts. However, in practice, the assumption can still be violated. In this case, it becomes even more important to use the training set mean and standard deviation to scale the test set.

Please have a look at this entry here, where I tried to make this a bit more clear: https://sebastianraschka.com/faq/docs/scale-training-test.html

so my question is that : the StandardScaler class use the train test's 𝜇 (sample mean) and 𝜎 (standard deviation), or the whole data sets's 𝜇 and 𝜎

It uses the mean and standard deviation of the dataset that was provided via the fit() method. E.g., below it would be training set's 𝜇 and 𝜎 because I use sc.fit(X_train):

sc = StandardScaler()
sc.fit(X_train)
X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test