btmartin721 commented 2 years ago

System information OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10, in Visual Studio Code IDE Debugger with Anaconda environment. Issue also occurs in Linux Ubuntu 20.04. Sklearn-genetic-opt version: 0.8.0 Scikit-learn version: 1.0.2 Python version: 3.8.12

Describe the issue Hi. I am having an issue with GASearchCV that I am hoping you could help me with. I have a custom Keras/ Tensorflow classifier that is wrapped in scikeras's KerasClassifier to make a scikit-learn estimator. With GridSearchCV, it takes roughly 15-16 seconds per parameter permutation, but with GASearchCV it takes about 20.5 minutes per iteration, which is a huge difference. I have also tested it with RandomizedSearchCV and, similarly to GridSearchCV, it also takes ~15 seconds per iteration. Do you have any insight on why GASearchCV might be taking so much longer per iteration? I realize that it might be difficult to pinpoint the issue given that my code spans multiple files, but hopefully I can give enough info below for it to make sense.

To Reproduce Steps to reproduce the behavior:

The code can be obtained from https://github.com/btmartin721/PG-SUI.

The code where I call GASearchCV (modified for clarity):


class DisabledCV:
def __init__(self):
    self.n_splits = 1

def split(self, X, y, groups=None):
    yield (np.arange(len(X)), np.arange(len(y)))

def get_n_splits(self, X, y, groups=None):
    return self.n_splits

Disable cross-validation due to unsupervised model.

cross_val = DisabledCV()

callback = [ ConsecutiveStopping( generations=self.early_stop_gen, metric="fitness" ) ]

Option passed to init() in custom class.

if not self.disable_progressbar: callback.append(ProgressBar())

verbose = False if self.verbose == 0 else True

Custom, subclassed KerasClassifier (scikeras) estimator

model_params and compile_params are set earlier in the code.

clf = MLPClassifier( V, model_params.pop("y_train"), y_true, **model_params, optimizer=compile_params["optimizer"], optimizer__learning_rate=compile_params["learning_rate"], loss=compile_params["loss"], metrics=compile_params["metrics"], epochs=fit_params["epochs"], phase=None, callbacks=fit_params["callbacks"], validation_split=fit_params["validation_split"], verbose=0, )

Custom scoring metrics.

all_scoring_metrics = [ "precision_recall_macro", "precision_recall_micro", "auc_macro", "auc_micro", "accuracy", ]

Make multi-metric scorers from custom metrics.

scoring = self.nn_.make_multimetric_scorer( all_scoring_metrics, self.sim_missingmask )

Set in pg_sui.py

grid_params = { "learning_rate": Continuous(1e-6, 0.1, distribution="log-uniform"), "l2_penalty": Continuous(1e-6, 0.01, distribution="uniform"), "n_components": Integer(2, 3), "hidden_activation": Categorical(["elu", "relu"]), }

code here has been modified for clarity.

search = GASearchCV( estimator=clf, cv=cross_val, scoring=scoring, generations=80, param_grid=grid_params, n_jobs=4, refit="precision_recall_macro", verbose=0, error_score="raise", )

input V are small, randomly initialized values that get trained and refined via backpropagation.

V is a dictionary of reduced-representation 2D arrays of shape (n_samples, n_components).

y_true is the actual data, using non-missing values as the target. This trained model is then used to predict (i.e., impute) missing values.

search.fit(V[self.n_components], y_true, callbacks=callback)



3. Describe the behavior: Runs, but extremely slowly compared to GridSearchCV and RandomizedSearchCV.
4. The command I used from root directory of GitHub repository: ```pgsui/pg_sui.py -s pgsui/example_data/structure_files/test.nopops.2row.10sites.str -m pgsui/example_data/popmaps/test.popmap -i pgsui/example_data/trees/test.iqtree -t pgsui/example_data/trees/test.tre```

Thank you for your time.

-Bradley

rodrigo-arenas commented 2 years ago

Hi @btmartin721, thanks for your message; I'll try to break down the explanation:

In general, the AutoML models are meant to run for more extended periods of time than more straightforward methods such as GridSearchCV to explore/exploit the hyperparameters space
Each iteration of GASearchCV fits several models (combination of hyperparameters), the initial number of models per iteration is controlled by the parameter population_size, at the current version, the default value is 80, this is, 80 different models are fit in generation 0
The number of models per generation changes depending on some decisions that the model takes; in this image, for example, we started with 15 models; at generation 1, we already had 29 models and kept around this number
There are several available algorithms; some of them may introduce some extra time
If you want to reduce the time it takes to run the models, you can try:
- Decrease the parameter population_size
- Change the default algorithm parameter from eaMuPlusLambda to eaSimple
- Increase the number of n_jobs if you have available CPU

If you want to know more details about this, I wrote this article trying to explain in more depth some of those points

I hope it helps!

btmartin721 commented 2 years ago

Hi @rodrigo-arenas,

Ah ok. That makes sense now. Thank you for your explanation. I will try changing the parameters you mentioned to reduce the runtime.

I greatly appreciate your time!

-Bradley

rodrigo-arenas / Sklearn-genetic-opt

GASearchCV Really Slow Compared to GridSearchCV #90