openml-labs / gama

An automated machine learning tool aimed to facilitate AutoML research.
https://openml-labs.github.io/gama/master/
Apache License 2.0
96 stars 31 forks source link

Unable to warm-start AsyncEA #197

Closed leightonvg closed 1 year ago

leightonvg commented 1 year ago

I have tried to warm-start AsyncEA search, but it seems to not work. Is there perhaps anything I am doing incorrectly?

I ran the following code to warm-start search with 25 individuals.

from gama import GamaClassifier
from gama.search_methods.async_ea import AsyncEA
from gama.postprocessing import EnsemblePostProcessing
from gama.genetic_programming.components.individual import Individual
from gama.configuration.classification import clf_config
from gama.configuration.parser import pset_from_config
from sklearn.pipeline import Pipeline

# function for individual to sklearn.Pipeline.pipeline
def ind_to_pipeline(ind: Individual):
    steps = []
    for i, primitive_node in reversed(list(enumerate(ind.primitives))):
        steps.append((str(i), primitive_node.str_nonrecursive))
    return Pipeline(steps)

primitive_set, _ = pset_from_config(clf_config)
pipelines = [
          "ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.15000000000000002, min_samples_leaf=1, min_samples_split=13, ExtraTreesClassifier.n_estimators=100)",
          "ExtraTreesClassifier(StandardScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.1, min_samples_leaf=1, min_samples_split=9, ExtraTreesClassifier.n_estimators=100)",
          ..........
          "ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.8, min_samples_leaf=8, min_samples_split=15, ExtraTreesClassifier.n_estimators=100)",
          "GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=9, GradientBoostingClassifier.max_features=0.8500000000000001, GradientBoostingClassifier.min_samples_leaf=9, GradientBoostingClassifier.min_samples_split=6, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.7500000000000001)"
]

# converting the individual string representations to actual individuals
warm_starters = []
for pipe in pipelines:
    ind = Individual.from_string(pipe, primitive_set, ind_to_pipeline)
    warm_starters.append(ind)

automl = GamaClassifier(
            scoring="neg_log_loss",
            max_total_time=120,
            store="logs",
            n_jobs=8,
            output_directory="test_store_warm_start2",
            search=AsyncEA(),
            post_processing=EnsemblePostProcessing(time_fraction=0.2, max_models=10000),
            verbosity=1,
        )
automl.fit_from_file("metadatabase_openml18cc/datasets/0.arff", warm_start=warm_starters)

Which yields the following:

Error during auto ensemble: division by zero
Traceback (most recent call last):
  File "c:\Users\leigh\miniconda3\envs\thesis_gama\Lib\site-packages\gama\postprocessing\ensemble.py", line 520, in build_fit_ensemble
    ensemble.build_initial_ensemble(10)
  File "c:\Users\leigh\miniconda3\envs\thesis_gama\Lib\site-packages\gama\postprocessing\ensemble.py", line 261, in build_initial_ensemble
    self._ensemble_validation_score()
  File "c:\Users\leigh\miniconda3\envs\thesis_gama\Lib\site-packages\gama\postprocessing\ensemble.py", line 440, in _ensemble_validation_score
    prediction_to_validate = self._averaged_validation_predictions()
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\leigh\miniconda3\envs\thesis_gama\Lib\site-packages\gama\postprocessing\ensemble.py", line 236, in _averaged_validation_predictions
    return weighted_sum_predictions / self._total_model_weights()
           ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ZeroDivisionError: division by zero

I also get no results in the evaluation library, because the following yields an empty list.:

automl._evaluation_library.n_best(n=100, with_pipelines=True)

In contrast to this when I do not provide any individuals for warm-starting I do not get the Error, and I do get evaluations.

from gama import GamaClassifier
from gama.search_methods.async_ea import AsyncEA
from gama.postprocessing import EnsemblePostProcessing

automl = GamaClassifier(
            scoring="neg_log_loss",
            max_total_time=120,
            store="logs",
            n_jobs=8,
            output_directory="test_store_warm_start2",
            search=AsyncEA(),
            post_processing=EnsemblePostProcessing(time_fraction=0.2, max_models=10000),
            verbosity=1,
        )
automl.fit_from_file("metadatabase_openml18cc/datasets/0.arff", warm_start=[])

Is there anything I am doing incorrectly to warm start search?

prabhant commented 1 year ago

looks like an error because of ensembling, can you try bestfit as well here and see if that works.

PGijsbers commented 1 year ago

Thanks for opening the issue. Warm-start is pretty round about right now. I added a quick thing here which makes it easier to just warm-start from strings: https://github.com/openml-labs/gama/tree/fix_warm_start Could you please try this (make sure to work with a version of that branch)?

from gama import GamaClassifier
from gama.search_methods.async_ea import AsyncEA
from gama.postprocessing import EnsemblePostProcessing
from gama.genetic_programming.components.individual import Individual
from gama.configuration.classification import clf_config
from gama.configuration.parser import pset_from_config
from sklearn.pipeline import Pipeline

warm_starters = [
          "ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.15000000000000002, min_samples_leaf=1, min_samples_split=13, ExtraTreesClassifier.n_estimators=100)",
          "ExtraTreesClassifier(StandardScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.1, min_samples_leaf=1, min_samples_split=9, ExtraTreesClassifier.n_estimators=100)",
          ..........  # please add a larger list of individuals again
          "ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.8, min_samples_leaf=8, min_samples_split=15, ExtraTreesClassifier.n_estimators=100)",
          "GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=9, GradientBoostingClassifier.max_features=0.8500000000000001, GradientBoostingClassifier.min_samples_leaf=9, GradientBoostingClassifier.min_samples_split=6, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.7500000000000001)"
]

automl = GamaClassifier(
            scoring="neg_log_loss",
            max_total_time=120,
            store="logs",
            n_jobs=8,
            output_directory="test_store_warm_start2",
            search=AsyncEA(),
            post_processing=EnsemblePostProcessing(time_fraction=0.2, max_models=10000),
            verbosity=1,
        )
automl.fit_from_file("metadatabase_openml18cc/datasets/0.arff", warm_start=warm_starters)

If that doesn't work, or you can test it, would you mind sharing the full list of individuals as well as the datasets?

leightonvg commented 1 year ago

Hi, thank you for the quick responses and suggestions. I have tested your suggestions. @prabhant, unfortunately your suggestion did not seem to help. @PGijsbers, your suggestion fixes my issue. I suppose in the initial code I created the individuals incorrectly causing the issue.

PGijsbers commented 1 year ago

Great to hear that worked! I'll make sure the change makes it into the next release :)

To be fair, it used to be pretty convoluted to reinstantiate individuals in the correct way, because you needed data from GAMA's object after it completed the preprocessing phase. So doing that externally was pretty hacky in the first place. You got very close, but doing it internally to GAMA makes it much less convoluted.

leightonvg commented 1 year ago

It has crossed my eye that GAMA uses, for me, weird string representations for the individuals. Take the following batch of individuals created by GAMA:

0:  ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.15000000000000002, min_samples_leaf=1, min_samples_split=13, ExtraTreesClassifier.n_estimators=100)
1:  ExtraTreesClassifier(StandardScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.1, min_samples_leaf=1, min_samples_split=9, ExtraTreesClassifier.n_estimators=100)
2:  ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=True, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.55, min_samples_leaf=1, min_samples_split=11, ExtraTreesClassifier.n_estimators=100)
3:  ExtraTreesClassifier(RobustScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.3, min_samples_leaf=6, min_samples_split=17, ExtraTreesClassifier.n_estimators=100)
4:  ExtraTreesClassifier(RobustScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), ExtraTreesClassifier.bootstrap=True, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.6000000000000001, min_samples_leaf=3, min_samples_split=19, ExtraTreesClassifier.n_estimators=100)
5:  ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.15000000000000002, min_samples_leaf=6, min_samples_split=2, ExtraTreesClassifier.n_estimators=100)
6:  RandomForestClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), RandomForestClassifier.bootstrap=True, RandomForestClassifier.criterion='entropy', RandomForestClassifier.max_features=0.6500000000000001, RandomForestClassifier.min_samples_leaf=2, RandomForestClassifier.min_samples_split=11, RandomForestClassifier.n_estimators=100)
7:  ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.25, min_samples_leaf=6, min_samples_split=2, ExtraTreesClassifier.n_estimators=100)
8:  GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=9, GradientBoostingClassifier.max_features=0.6500000000000001, GradientBoostingClassifier.min_samples_leaf=9, GradientBoostingClassifier.min_samples_split=2, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.9000000000000001)
9:  GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=7, GradientBoostingClassifier.max_features=0.55, GradientBoostingClassifier.min_samples_leaf=13, GradientBoostingClassifier.min_samples_split=20, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.9000000000000001)
10:  ExtraTreesClassifier(MaxAbsScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.45, min_samples_leaf=8, min_samples_split=13, ExtraTreesClassifier.n_estimators=100)
11:  ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.25, min_samples_leaf=6, min_samples_split=17, ExtraTreesClassifier.n_estimators=100)
12:  ExtraTreesClassifier(PolynomialFeatures(MaxAbsScaler(data), PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.35000000000000003, min_samples_leaf=7, min_samples_split=2, ExtraTreesClassifier.n_estimators=100)
13:  GradientBoostingClassifier(MinMaxScaler(data), GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=6, GradientBoostingClassifier.max_features=0.6000000000000001, GradientBoostingClassifier.min_samples_leaf=15, GradientBoostingClassifier.min_samples_split=19, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.9500000000000001)
14:  RandomForestClassifier(MaxAbsScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), RandomForestClassifier.bootstrap=False, RandomForestClassifier.criterion='gini', RandomForestClassifier.max_features=0.45, RandomForestClassifier.min_samples_leaf=5, RandomForestClassifier.min_samples_split=14, RandomForestClassifier.n_estimators=100)
15:  ExtraTreesClassifier(data, ExtraTreesClassifier.bootstrap=True, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.8500000000000001, min_samples_leaf=2, min_samples_split=5, ExtraTreesClassifier.n_estimators=100)
16:  GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=8, GradientBoostingClassifier.max_features=0.55, GradientBoostingClassifier.min_samples_leaf=4, GradientBoostingClassifier.min_samples_split=3, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.7000000000000001)
17:  ExtraTreesClassifier(data, ExtraTreesClassifier.bootstrap=True, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.7000000000000001, min_samples_leaf=2, min_samples_split=7, ExtraTreesClassifier.n_estimators=100)
18:  GradientBoostingClassifier(MinMaxScaler(data), GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=10, GradientBoostingClassifier.max_features=0.7000000000000001, GradientBoostingClassifier.min_samples_leaf=18, GradientBoostingClassifier.min_samples_split=10, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.8)
19:  GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=8, GradientBoostingClassifier.max_features=0.8, GradientBoostingClassifier.min_samples_leaf=2, GradientBoostingClassifier.min_samples_split=13, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.7500000000000001)
20:  GradientBoostingClassifier(MaxAbsScaler(data), GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=6, GradientBoostingClassifier.max_features=0.5, GradientBoostingClassifier.min_samples_leaf=13, GradientBoostingClassifier.min_samples_split=3, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=1.0)
21:  GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=8, GradientBoostingClassifier.max_features=0.5, GradientBoostingClassifier.min_samples_leaf=6, GradientBoostingClassifier.min_samples_split=11, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.9500000000000001)
22:  ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.8, min_samples_leaf=5, min_samples_split=11, ExtraTreesClassifier.n_estimators=100)
23:  ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.8, min_samples_leaf=8, min_samples_split=15, ExtraTreesClassifier.n_estimators=100)
24:  GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=9, GradientBoostingClassifier.max_features=0.8500000000000001, GradientBoostingClassifier.min_samples_leaf=9, GradientBoostingClassifier.min_samples_split=6, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.7500000000000001)

Note that for some sklearn algorithms the global parameters are pre-fixed by their algorithm name, while for other this is not the case. For instance, individual 5 does not prefix (e.g. min_samples_leaf=6, min_samples_split=2), while individual 6 does (RandomForestClassifier.min_samples_leaf=2, RandomForestClassifier.min_samples_split=11). This difference seems to be consistent with the way the search space was defined in gama.configuration.classification for those algorithms.

The consequence is that for warm-starting with individual strings the string representation of the individuals may come across confusing, though it is not limiting its use of course. I do not necessarily need a fix for this, but I was wondering whether this behavior is as intended?

PGijsbers commented 1 year ago

The idea was that similar hyperparameters for related algorithms may be shared with each other (i.e., they can be exchanged during a cross-over operation). But it's an oversight that RandomForest does not make use of this. (also, we still need to evaluate whether or not this actually benefits optimization)

leightonvg commented 1 year ago

Closing this Issue since my question has been answered properly.