h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.78k stars 1.99k forks source link

GBM algorithm with H2O AutoML fails on small dataset #16127

Open magrenimish opened 3 months ago

magrenimish commented 3 months ago

With H2O==3.44.0.1, the GBM algorithm within the H2O AutoML function fails for a small dataset. Following is the error message I see:

AutoML progress: | | 0% 18:23:48.910: _min_rows param, The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 111.0.

AutoML progress: |███████████████████████████████████████████████████████████████████████████████████ (failed)| 100%

18:58:48.594: GBM_grid_1_AutoML_1_20240315_182348 [GBM Grid Search] failed: java.util.NoSuchElementException: No more elements to explore in hyper-space!

Traceback (most recent call last): File "gbm_trial.py", line 27, in aml.train(x=fr.columns[:-1], y=fr.columns[-1], training_frame=fr) File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/automl/_estimator.py", line 682, in train self._job.poll(poll_updates=poll_updates) File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/job.py", line 89, in poll "\n{}".format(self.job_key, self.exception, self.job["stacktrace"])) OSError: Job with key $03017f00000132d4ffffffff$_b3beac9d2c78e11fed761fec428942d3 failed with an exception: java.lang.NullPointerException stacktrace: java.lang.NullPointerException at ai.h2o.automl.AutoML.cleanUpModelsCVPreds(AutoML.java:910) at ai.h2o.automl.AutoML.stop(AutoML.java:525) at ai.h2o.automl.AutoML.run(AutoML.java:496) at ai.h2o.automl.H2OJob$1.compute2(H2OJob.java:33) at water.H2O$H2OCountedCompleter.compute(H2O.java:1689) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Closing connection _sid_b08a at exit H2O session _sid_b08a closed.

Following is the code I used: fr = h2o.create_frame(rows=111, cols=29, real_fraction=1.0, categorical_fraction=0, has_response=True, response_factors=2, seed=12345, missing_fraction=0.0) aml = H2OAutoML(max_runtime_secs=10000, include_algos=["GBM"]) aml.train(x=fr.columns[:-1], y=fr.columns[-1], training_frame=fr) h2o.shutdown()

tomasfryda commented 3 months ago

@magrenimish Thank you for creating this issue and bringing this to our attention. AutoML should have failed with a nicer message, e.g., No model was trained.. GBM requires more data in order to be trained as mentioned in the warning The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 111.0..

magrenimish commented 3 months ago

@tomasfryda would it be possible to then skip or exclude the GBM algorithm with H2O AutoML without explicitly specifying it with the 'exclude_algos' parameter? For example: With the following code: fr = h2o.create_frame(rows=111, cols=29, real_fraction=1.0, categorical_fraction=0, has_response=True, response_factors=2, seed=12345, missing_fraction=0.0) aml = H2OAutoML(max_runtime_secs=10000) aml.train(x=fr.columns[:-1], y=fr.columns[-1], training_frame=fr) h2o.shutdown() The function fails with GBM, but would it be possible to skip GBM in this case?

tomasfryda commented 3 months ago

@magrenimish that's basically what should happen. AutoML doesn't want to know about underlying constraints of individual models so first each model runs its parameter/training data validation logic and if that fails, the model won't train. The validation logic is also responsible for emitting the warning to inform the user of what went wrong (e.g. The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 111.0.).

It's hard to exclude automatically whole class of models since each model in AutoML has different parameters and the failures are often dependent on the parameters.

magrenimish commented 3 months ago

@tomasfryda thank you! So if I want the AutoML function to continue without the GBM algorithm, then I either have to explicitly exclude it with the 'exclude_algos' parameter or catch the specific error and skip the algorithm?

tomasfryda commented 3 months ago

@magrenimish you can just ignore the warning.

When I run your code, I can still get the automl to train and it looks some GBMs have parameters that enable training with low amount of data:

In [3]: fr = h2o.create_frame(rows=111, cols=29, real_fraction=1.0, categorical_fraction=0, has_response=True, response_factors=2, seed=12345, miss
   ...: ing_fraction=0.0)

In [6]: from h2o.automl import H2OAutoML

In [7]: aml = H2OAutoML(max_runtime_secs=100)

In [8]: aml.train(x=fr.columns[:-1], y=fr.columns[-1], training_frame=fr)
AutoML progress: |▉                                                                                  |   1%
16:41:20.27: _min_rows param, The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 111.0.

AutoML progress: |███████████████████████████████████████████████████████████████████████████████████ (done)| 100%

In [9]: aml.leaderboard
Out[9]:
model_id                                                                     rmse      mse      mae    rmsle    mean_residual_deviance
------------------------------------------------------------------------  -------  -------  -------  -------  ------------------------
GBM_grid_1_AutoML_1_20240318_164118_model_49                              56.1927  3157.62  49.7652      nan                   3157.62
GBM_grid_1_AutoML_1_20240318_164118_model_10                              56.364   3176.9   49.8405      nan                   3176.9
GBM_grid_1_AutoML_1_20240318_164118_model_8                               56.4429  3185.8   49.9569      nan                   3185.8
GBM_grid_1_AutoML_1_20240318_164118_model_17                              56.459   3187.62  50.0887      nan                   3187.62
GBM_grid_1_AutoML_1_20240318_164118_model_52                              56.472   3189.08  49.9436      nan                   3189.08
GBM_grid_1_AutoML_1_20240318_164118_model_21                              56.5403  3196.8   49.9926      nan                   3196.8
GBM_grid_1_AutoML_1_20240318_164118_model_46                              56.6061  3204.25  50.4635      nan                   3204.25
StackedEnsemble_BestOfFamily_5_AutoML_1_20240318_164118                   56.6463  3208.8   50.668       nan                   3208.8
GBM_grid_1_AutoML_1_20240318_164118_model_32                              56.8033  3226.62  50.2397      nan                   3226.62
XGBoost_lr_search_selection_AutoML_1_20240318_164118_select_grid_model_6  56.8386  3230.63  50.8662      nan                   3230.63
[166 rows x 6 columns]