TuneSearchCV and a Pipeline with feature selection

kienerj commented 3 years ago

I have scikit-learn pipeline I want to use for parameter optimization:

clf = Pipeline([
            ('low_variance', VarianceThreshold(threshold=0)),
            ('feature_importance',
             SelectMaxFeaturesFromModel(RandomForestClassifier(), threshold='0.25*median', max_features=100)),
            ('classification', xgb)
        ])

xgb = and xgboost classifier.

When I run this code I get an error:

Check failed: learner_model_param_.num_feature == p_fmat->Info().num_col_ (7 vs. 8) : Number of columns does not match number of features in booster.

Without the pipeline, just the model, it works fine.

It is clear that the pipeline may return a different number of features per CV split. I suspect somewhere in the code the assumption is made that the feature count is constant? Or if a feature is removed for a fold in the training set it then stays removed when said fold becomes the test set?

I'm saying above because the exact same pipline works perfectly fine when running it using Optuna directly without ray. So it seems there is some form of optimization going on in ray/tune-sklearn that leads to this problem.

Any advice or idea how to solve this?

Yard1 commented 3 years ago

What arguments are you running TuneSearchCV with?

kienerj commented 3 years ago

self.tune_search = TuneSearchCV(clf,
                                        param_distributions=self.params,
                                        n_trials=iterations,
                                        early_stopping=True,  # uses Async HyperBand if set to True
                                        max_iters=10,
                                        search_optimization="optuna",
                                        cv=5,
                                        scoring=self.scorer,
                                        mode=self.metric_mode
                                        )

scorer in this case is just "accuracy".

And the parameters are:

self.params = {
            "classification__n_estimators": tune.qrandint(self.min_rounds, self.max_rounds, 10),
            "classification__max_depth": tune.randint(self.min_max_depth, self.max_max_depth),
            "classification__min_child_weight": tune.randint(1, 4),
            "classification__subsample": tune.quniform(0.5, 1.0, 0.1),
            "classification__eta": tune.qloguniform(self.min_learning_rate, self.max_learning_rate, 1e-3)
        }

Yard1 commented 3 years ago

Does it work with early stopping set to False?

kienerj commented 3 years ago

Actually now it also works with same settings as above. Of course due to restart and everything the internal CV split will be different which in my opinion is the cause for the issue. Right now I can't reproduce the issue even with early_stopping=True.

Yard1 commented 3 years ago

Can you set random seeds to some constant value?

kienerj commented 3 years ago

I can but seems like a chicken egg problem, First I need a seed that triggers the issue.

Yard1 commented 3 years ago

I'll close this for now, @kienerj. Feel free to reopen if the issue persists.

ray-project / tune-sklearn

TuneSearchCV and a Pipeline with feature selection #223