Closed leightonvg closed 1 year ago
looks like an error because of ensembling, can you try bestfit as well here and see if that works.
Thanks for opening the issue. Warm-start is pretty round about right now. I added a quick thing here which makes it easier to just warm-start from strings: https://github.com/openml-labs/gama/tree/fix_warm_start Could you please try this (make sure to work with a version of that branch)?
from gama import GamaClassifier
from gama.search_methods.async_ea import AsyncEA
from gama.postprocessing import EnsemblePostProcessing
from gama.genetic_programming.components.individual import Individual
from gama.configuration.classification import clf_config
from gama.configuration.parser import pset_from_config
from sklearn.pipeline import Pipeline
warm_starters = [
"ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.15000000000000002, min_samples_leaf=1, min_samples_split=13, ExtraTreesClassifier.n_estimators=100)",
"ExtraTreesClassifier(StandardScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.1, min_samples_leaf=1, min_samples_split=9, ExtraTreesClassifier.n_estimators=100)",
.......... # please add a larger list of individuals again
"ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.8, min_samples_leaf=8, min_samples_split=15, ExtraTreesClassifier.n_estimators=100)",
"GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=9, GradientBoostingClassifier.max_features=0.8500000000000001, GradientBoostingClassifier.min_samples_leaf=9, GradientBoostingClassifier.min_samples_split=6, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.7500000000000001)"
]
automl = GamaClassifier(
scoring="neg_log_loss",
max_total_time=120,
store="logs",
n_jobs=8,
output_directory="test_store_warm_start2",
search=AsyncEA(),
post_processing=EnsemblePostProcessing(time_fraction=0.2, max_models=10000),
verbosity=1,
)
automl.fit_from_file("metadatabase_openml18cc/datasets/0.arff", warm_start=warm_starters)
If that doesn't work, or you can test it, would you mind sharing the full list of individuals as well as the datasets?
Hi, thank you for the quick responses and suggestions. I have tested your suggestions. @prabhant, unfortunately your suggestion did not seem to help. @PGijsbers, your suggestion fixes my issue. I suppose in the initial code I created the individuals incorrectly causing the issue.
Great to hear that worked! I'll make sure the change makes it into the next release :)
To be fair, it used to be pretty convoluted to reinstantiate individuals in the correct way, because you needed data from GAMA's object after it completed the preprocessing phase. So doing that externally was pretty hacky in the first place. You got very close, but doing it internally to GAMA makes it much less convoluted.
It has crossed my eye that GAMA uses, for me, weird string representations for the individuals. Take the following batch of individuals created by GAMA:
0: ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.15000000000000002, min_samples_leaf=1, min_samples_split=13, ExtraTreesClassifier.n_estimators=100)
1: ExtraTreesClassifier(StandardScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.1, min_samples_leaf=1, min_samples_split=9, ExtraTreesClassifier.n_estimators=100)
2: ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=True, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.55, min_samples_leaf=1, min_samples_split=11, ExtraTreesClassifier.n_estimators=100)
3: ExtraTreesClassifier(RobustScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.3, min_samples_leaf=6, min_samples_split=17, ExtraTreesClassifier.n_estimators=100)
4: ExtraTreesClassifier(RobustScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), ExtraTreesClassifier.bootstrap=True, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.6000000000000001, min_samples_leaf=3, min_samples_split=19, ExtraTreesClassifier.n_estimators=100)
5: ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.15000000000000002, min_samples_leaf=6, min_samples_split=2, ExtraTreesClassifier.n_estimators=100)
6: RandomForestClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), RandomForestClassifier.bootstrap=True, RandomForestClassifier.criterion='entropy', RandomForestClassifier.max_features=0.6500000000000001, RandomForestClassifier.min_samples_leaf=2, RandomForestClassifier.min_samples_split=11, RandomForestClassifier.n_estimators=100)
7: ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.25, min_samples_leaf=6, min_samples_split=2, ExtraTreesClassifier.n_estimators=100)
8: GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=9, GradientBoostingClassifier.max_features=0.6500000000000001, GradientBoostingClassifier.min_samples_leaf=9, GradientBoostingClassifier.min_samples_split=2, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.9000000000000001)
9: GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=7, GradientBoostingClassifier.max_features=0.55, GradientBoostingClassifier.min_samples_leaf=13, GradientBoostingClassifier.min_samples_split=20, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.9000000000000001)
10: ExtraTreesClassifier(MaxAbsScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.45, min_samples_leaf=8, min_samples_split=13, ExtraTreesClassifier.n_estimators=100)
11: ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.25, min_samples_leaf=6, min_samples_split=17, ExtraTreesClassifier.n_estimators=100)
12: ExtraTreesClassifier(PolynomialFeatures(MaxAbsScaler(data), PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.35000000000000003, min_samples_leaf=7, min_samples_split=2, ExtraTreesClassifier.n_estimators=100)
13: GradientBoostingClassifier(MinMaxScaler(data), GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=6, GradientBoostingClassifier.max_features=0.6000000000000001, GradientBoostingClassifier.min_samples_leaf=15, GradientBoostingClassifier.min_samples_split=19, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.9500000000000001)
14: RandomForestClassifier(MaxAbsScaler(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False)), RandomForestClassifier.bootstrap=False, RandomForestClassifier.criterion='gini', RandomForestClassifier.max_features=0.45, RandomForestClassifier.min_samples_leaf=5, RandomForestClassifier.min_samples_split=14, RandomForestClassifier.n_estimators=100)
15: ExtraTreesClassifier(data, ExtraTreesClassifier.bootstrap=True, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.8500000000000001, min_samples_leaf=2, min_samples_split=5, ExtraTreesClassifier.n_estimators=100)
16: GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=8, GradientBoostingClassifier.max_features=0.55, GradientBoostingClassifier.min_samples_leaf=4, GradientBoostingClassifier.min_samples_split=3, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.7000000000000001)
17: ExtraTreesClassifier(data, ExtraTreesClassifier.bootstrap=True, ExtraTreesClassifier.criterion='entropy', ExtraTreesClassifier.max_features=0.7000000000000001, min_samples_leaf=2, min_samples_split=7, ExtraTreesClassifier.n_estimators=100)
18: GradientBoostingClassifier(MinMaxScaler(data), GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=10, GradientBoostingClassifier.max_features=0.7000000000000001, GradientBoostingClassifier.min_samples_leaf=18, GradientBoostingClassifier.min_samples_split=10, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.8)
19: GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=8, GradientBoostingClassifier.max_features=0.8, GradientBoostingClassifier.min_samples_leaf=2, GradientBoostingClassifier.min_samples_split=13, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.7500000000000001)
20: GradientBoostingClassifier(MaxAbsScaler(data), GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=6, GradientBoostingClassifier.max_features=0.5, GradientBoostingClassifier.min_samples_leaf=13, GradientBoostingClassifier.min_samples_split=3, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=1.0)
21: GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=8, GradientBoostingClassifier.max_features=0.5, GradientBoostingClassifier.min_samples_leaf=6, GradientBoostingClassifier.min_samples_split=11, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.9500000000000001)
22: ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.8, min_samples_leaf=5, min_samples_split=11, ExtraTreesClassifier.n_estimators=100)
23: ExtraTreesClassifier(PolynomialFeatures(data, PolynomialFeatures.degree=2, PolynomialFeatures.include_bias=False, PolynomialFeatures.interaction_only=False), ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.max_features=0.8, min_samples_leaf=8, min_samples_split=15, ExtraTreesClassifier.n_estimators=100)
24: GradientBoostingClassifier(data, GradientBoostingClassifier.learning_rate=0.1, GradientBoostingClassifier.max_depth=9, GradientBoostingClassifier.max_features=0.8500000000000001, GradientBoostingClassifier.min_samples_leaf=9, GradientBoostingClassifier.min_samples_split=6, GradientBoostingClassifier.n_estimators=100, GradientBoostingClassifier.subsample=0.7500000000000001)
Note that for some sklearn algorithms the global parameters are pre-fixed by their algorithm name, while for other this is not the case. For instance, individual 5 does not prefix (e.g. min_samples_leaf=6, min_samples_split=2), while individual 6 does (RandomForestClassifier.min_samples_leaf=2, RandomForestClassifier.min_samples_split=11). This difference seems to be consistent with the way the search space was defined in gama.configuration.classification for those algorithms.
The consequence is that for warm-starting with individual strings the string representation of the individuals may come across confusing, though it is not limiting its use of course. I do not necessarily need a fix for this, but I was wondering whether this behavior is as intended?
The idea was that similar hyperparameters for related algorithms may be shared with each other (i.e., they can be exchanged during a cross-over operation). But it's an oversight that RandomForest does not make use of this. (also, we still need to evaluate whether or not this actually benefits optimization)
Closing this Issue since my question has been answered properly.
I have tried to warm-start AsyncEA search, but it seems to not work. Is there perhaps anything I am doing incorrectly?
I ran the following code to warm-start search with 25 individuals.
Which yields the following:
I also get no results in the evaluation library, because the following yields an empty list.:
In contrast to this when I do not provide any individuals for warm-starting I do not get the Error, and I do get evaluations.
Is there anything I am doing incorrectly to warm start search?