Prediction with preprocessing way off after optimization

statcom commented 4 years ago

I am using hyperopt-sklearn to optimize hyperparameters for PCA+SVM. The optimization seemed to run fine with validation error near zero. But when I predicted "validation" and "test" data, the accuracies were about 0.1. I know this is wrong because the validation data was a part of training data. I didn't have this problem without preprocessing. What am I doing wrong?

# Create the estimator object
model = HyperoptEstimator(classifier=svc_rbf("my_svc"),
                          preprocessing=[pca("my_pca")],
                          algo=tpe.suggest, max_evals=10, trial_timeout=180)

# starts TPE optimization
model.fit(train_x, train_y)

score_train = model.score(feature_train, y_train)
score_validation = model.score(feature_validation, y_validation)
score_test = model.score(feature_test, y_test)

validation_acc = score_validation
print('---- final model result ---')
print('train acc:', score_train)
print('validation acc:', score_validation)
print('test acc:', score_test)
print('best model:', model.best_model())

bjkomer commented 4 years ago

That is definitely a bug. Was able to reproduce it on my end, and it seems like calling score was actually refitting the preprocessing instead of just transforming the data, which I don't think you would want to do in general (for a smaller validation set, this likely results in a bad/incorrect fit). Made a quick change to disable this behaviour by default (I think it got introduced accidentally during some previous changes). Fixed the issue on my toy example.

statcom commented 4 years ago

Fast & sweet! I found it's working now. Thanks!

hyperopt / hyperopt-sklearn

Prediction with preprocessing way off after optimization #153