Closed ReaBx closed 8 years ago
I would say there is no right answer here. Scaling is just a linear transformation. You can scale your X in any way you want. If transform()
works better here, then go for it. In my experience fit_transform()
usually works better, but it depends on the dataset.
Hm, but don't I have to scale X_test the same way I scaled X_train? I mean taking this example of house prices in Boston, when I look at the test set, don't I want to set it in relation to the training set? Why would I transform the test set to zero mean? What if by chance I have mostly larger houses in the test set? Thanks!
It depends highly on your application. There's also batch normalization that you can look into and may suit your particular needs.
Fit transform on evaluation data is very subtle issue. If you do it, you indeed will get a better results. But it also will hide if you validation sample is different from your training. At inference time you don't have a sample wide mean/STD deviation which means you will use one from training data. And if you picked your model based on evaluation results that are skewed by scaling on it's own mean/stddev, you may use non-optimal model.
I would highly recommend to use the same setup for evaluation and inference. And think of inference time as when it's launched in some service. Then you can rely on your validation/evaluation results to be representative. On Mar 29, 2016 10:35 AM, "Yuan (Terry) Tang" notifications@github.com wrote:
It depends highly on your application. There's also batch normalization that you can look into and may suit your particular needs.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/tensorflow/skflow/issues/157#issuecomment-203016530
In the DNN Regression example, when scoring the prediction on the test set, should the scaling not be done only with
scaler.transform(X_test)
instead ofscaler.fit_transform(X_test)
? I'm completely new to both sklearn and skflow and am just trying to understand this example. I understand why I scale the training set to zero mean and unit std dev, but wouldn't we want to rate the test set to this same mean?scaler.fit_transform(X_test)
again scales X_test to zero mean and unit std dev, right? But I do want to set it in comparison to X_train, don't I?Also, when I change
score = metrics.mean_squared_error(regressor.predict(scaler.fit_transform(X_test)), y_test)
toscore = metrics.mean_squared_error(regressor.predict(scaler.transform(X_test)), y_test)
the MSE is roughly halved.