h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.91k stars 2k forks source link

same estimaters not be modeled when train in automl #15850

Open feihongloveworld opened 1 year ago

feihongloveworld commented 1 year ago

i specific ["GLM", "DeepLearning", "DRF","GBM","XGBoost","StackedEnsemble"], and want every estimater is modeled, but same estimates are missing in the result of automl,

0C~G FURV PM}9D3YPK%5K7 9L ( 8N{6YU%DQ OF B G8A

tomasfryda commented 1 year ago

@feihongloveworld I'm not sure I understand the issue. It looks like everything works correctly. One potentially confusing thing is the XRT_1_... which is a type of DRF with different split rule.

Would you be able to rephrase the question/issue with more details?

feihongloveworld commented 1 year ago

@feihongloveworld I'm not sure I understand the issue. It looks like everything works correctly. One potentially confusing thing is the XRT_1_... which is a type of DRF with different split rule.

Would you be able to rephrase the question/issue with more details?

image

My expectation is that two models for each of the five algorithms and get another two ensembles when I set max_models=12, The result was not what I expected. some of algorithms were modeled more than twice and some were not modeled. so I was wandering how to ensure that each algorithm can be modeled. and how did autoML select final algorithm?

thankyou for you time and please forgive my english.

tomasfryda commented 1 year ago

Thank you @feihongloveworld ! Now I understand the question.

The AutoML has 2 basic modes that influence which models are ran in what order:

The max_models has the advantage that you know how many models will be trained and that the models won't be prematurely stopped (before converging). The max_runtime_secs tries to fill the time the best way possible but it can mean that training of some models will be stopped sooner than it would be ideal.

Since you use the max_models constraint I will limit this explanation to this mode.

First we train some models with predefined hyper-parameters that tend to behave well in various problems and then we run hyper-parameter search (I think you would have to specify max_models to higher number like 15 or so to get to the step with hyper-parameter search).

We call this a modeling plan. This then refers to the individual algorithms that we support. The predefined hyper-parameters can be seen here:

If you use max_models=12, you should get all the different algorithms modeled. There are 3 XGBoosts, 1 GLMs, 5 GBMs, 2 DRFs (DRF and XRT), 1 DeepLearnings and 2 StackedEnsembles. If you specify different parameters you can end up with different numbers (e.g. monotonic constraints will create one more StackedEnsemble with only models with monotonic constraints).

You can also create your own modeling plan. It's a bit more involved but it enables you to specify exactly what you want. See the test for more details.

How do we select the final algorithm?

If you specify leaderboard_frame we will use that for the scores on leaderboard and the winning algorithm on those scores will be the final model - you can use sort_metric to select which metric you want to use for that selection (defaults are AUC(binary classification), mean_per_class_error (multinomial classification), and RMSE (regression).

If you don't specify leaderboard_frame then AutoML uses either validation scores or cross-validation scores. When you use nfolds > 1 it will use cross-validation scores. If you use nfolds = 0, it will use validation scores (running without CV). If you keep the default (nfolds=-1), AutoML will try to estimate if it's better to use CV or not (if you use max_models then it will use CV, if you use max_runtime_secs then depending on the data size and number of max_runtime and cpu cores it will decide using some heuristic.

thankyou for you time and please forgive my english.

Don't worry about it, I'm also not a native english speaker.

feihongloveworld commented 1 year ago

@tomasfryda Thank you so much. Another question: Why was SVM not included in the AutoML ?

tomasfryda commented 1 year ago

@feihongloveworld I'm not sure but I think it's not included because of lack of MOJO (model serialization) support but it might be just the part of it. I'm not aware if we ran some benchmarks with it to decide if we want to include it or not.