Closed btmartin721 closed 2 years ago
Hi @btmartin721, thanks for your message; I'll try to break down the explanation:
In general, the AutoML models are meant to run for more extended periods of time than more straightforward methods such as GridSearchCV to explore/exploit the hyperparameters space
Each iteration of GASearchCV fits several models (combination of hyperparameters), the initial number of models per iteration is controlled by the parameter population_size,
at the current version, the default value is 80
, this is, 80 different models are fit in generation 0
The number of models per generation changes depending on some decisions that the model takes; in this image, for example, we started with 15 models; at generation 1, we already had 29 models and kept around this number
There are several available algorithms; some of them may introduce some extra time
If you want to reduce the time it takes to run the models, you can try:
population_size
algorithm
parameter from eaMuPlusLambda
to eaSimple
n_jobs
if you have available CPUIf you want to know more details about this, I wrote this article trying to explain in more depth some of those points
I hope it helps!
Hi @rodrigo-arenas,
Ah ok. That makes sense now. Thank you for your explanation. I will try changing the parameters you mentioned to reduce the runtime.
I greatly appreciate your time!
-Bradley
System information OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10, in Visual Studio Code IDE Debugger with Anaconda environment. Issue also occurs in Linux Ubuntu 20.04. Sklearn-genetic-opt version: 0.8.0 Scikit-learn version: 1.0.2 Python version: 3.8.12
Describe the issue Hi. I am having an issue with GASearchCV that I am hoping you could help me with. I have a custom Keras/ Tensorflow classifier that is wrapped in scikeras's KerasClassifier to make a scikit-learn estimator. With GridSearchCV, it takes roughly 15-16 seconds per parameter permutation, but with GASearchCV it takes about 20.5 minutes per iteration, which is a huge difference. I have also tested it with RandomizedSearchCV and, similarly to GridSearchCV, it also takes ~15 seconds per iteration. Do you have any insight on why GASearchCV might be taking so much longer per iteration? I realize that it might be difficult to pinpoint the issue given that my code spans multiple files, but hopefully I can give enough info below for it to make sense.
To Reproduce Steps to reproduce the behavior:
The code where I call GASearchCV (modified for clarity):
Disable cross-validation due to unsupervised model.
cross_val = DisabledCV()
callback = [ ConsecutiveStopping( generations=self.early_stop_gen, metric="fitness" ) ]
Option passed to init() in custom class.
if not self.disable_progressbar: callback.append(ProgressBar())
verbose = False if self.verbose == 0 else True
Custom, subclassed KerasClassifier (scikeras) estimator
model_params and compile_params are set earlier in the code.
clf = MLPClassifier( V, model_params.pop("y_train"), y_true, **model_params, optimizer=compile_params["optimizer"], optimizer__learning_rate=compile_params["learning_rate"], loss=compile_params["loss"], metrics=compile_params["metrics"], epochs=fit_params["epochs"], phase=None, callbacks=fit_params["callbacks"], validation_split=fit_params["validation_split"], verbose=0, )
Custom scoring metrics.
all_scoring_metrics = [ "precision_recall_macro", "precision_recall_micro", "auc_macro", "auc_micro", "accuracy", ]
Make multi-metric scorers from custom metrics.
scoring = self.nn_.make_multimetric_scorer( all_scoring_metrics, self.sim_missingmask )
Set in pg_sui.py
grid_params = { "learning_rate": Continuous(1e-6, 0.1, distribution="log-uniform"), "l2_penalty": Continuous(1e-6, 0.01, distribution="uniform"), "n_components": Integer(2, 3), "hidden_activation": Categorical(["elu", "relu"]), }
code here has been modified for clarity.
search = GASearchCV( estimator=clf, cv=cross_val, scoring=scoring, generations=80, param_grid=grid_params, n_jobs=4, refit="precision_recall_macro", verbose=0, error_score="raise", )
input V are small, randomly initialized values that get trained and refined via backpropagation.
V is a dictionary of reduced-representation 2D arrays of shape (n_samples, n_components).
y_true is the actual data, using non-missing values as the target. This trained model is then used to predict (i.e., impute) missing values.
search.fit(V[self.n_components], y_true, callbacks=callback)