h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 2k forks source link

h2o.grid with GAM + CV: NPE #6562

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

The following examples show that trying to build a GAM and using cross-validation via {{fold_column}} (and also {{nfolds}}) results in Null Pointer Exception only when used inside {{h2o.grid}}, but not {{h2o.gam}}.

{code:r}packageVersion("h2o")

[1] ‘3.36.1.4’

mt <- as.h2o(mtcars)

making mt a bigger dataset with triplicates

mt <- h2o.rbind(mt, mt, mt)

mt$fold <- h2o.kfold_column(data = mt, nfolds = 3, seed = 123)

regular non-CV grid search

mt_grid <- h2o.grid( algorithm = "gam", x = c("cyl", "am"), y = "mpg", training_frame = mt, lambda = 0, keep_gam_cols = TRUE, hyper_params = list( gam_columns = list("hp"), num_knots = list(3, 4), spline_orders = list(3), bs = list(0, 1), scale = list(0.01) ) )

mt_grid

H2O Grid Details

================

Grid ID: Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61

Used hyper parameters:

- bs

- gam_columns

- num_knots

- scale

- spline_orders

Number of models: 4

Number of failed models: 0

Hyper-Parameter Search Summary: ordered by increasing residual_deviance

bs gam_columns num_knots scale spline_orders model_ids residual_deviance

1 1 [Ljava.lang.String;@5e2f3042 4 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_4 515.44363

2 1 [Ljava.lang.String;@4d9dce 3 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_2 544.60905

3 0 [Ljava.lang.String;@67aed100 4 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_3 545.83712

4 0 [Ljava.lang.String;@1479990b 3 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_1 590.73138

same grid search, now with the "fold" column for CV models

mt_grid_w_fold <- h2o.grid( algorithm = "gam", x = c("cyl", "am"), y = "mpg", fold_column = "fold", # <- new arg training_frame = mt, lambda = 0, keep_gam_cols = TRUE, hyper_params = list( gam_columns = list("hp"), num_knots = list(3, 4), spline_orders = list(3), bs = list(0, 1), scale = list(0.01) ) )

failure with NPE

mt_grid_w_fold

H2O Grid Details

================

Grid ID: Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_68

Used hyper parameters:

- bs

- gam_columns

- num_knots

- scale

- spline_orders

Number of models: 0

Number of failed models: 4

NULL

Failed models

-------------

bs gam_columns num_knots scale spline_orders status_failed msgs_failed

[0] [[Ljava.lang.String;@3bc1cddc] [3] [0.01] [3] FAIL "NA"

[1] [[Ljava.lang.String;@743b2eef] [3] [0.01] [3] FAIL "NA"

[0] [[Ljava.lang.String;@39f5cc27] [4] [0.01] [3] FAIL "NA"

[1] [[Ljava.lang.String;@35f21357] [4] [0.01] [3] FAIL "NA"

all failed with same error message, copying just one

summary(mt_grid_w_fold, show_stack_traces = TRUE)

Number of failed models: 4

- NAjava.lang.NullPointerException

at hex.ModelBuilder.cv_AssignFold(ModelBuilder.java:693)

at hex.ModelBuilder.computeCrossValidation(ModelBuilder.java:613)

at hex.ModelBuilder.trainModelNested(ModelBuilder.java:432)

at hex.ModelBuilder$TrainModelNestedRunnable.run(ModelBuilder.java:467)

at water.H2O.runOnH2ONode(H2O.java:1565)

at water.H2O.runOnH2ONode(H2O.java:1554)

at hex.ModelBuilder.trainModelNested(ModelBuilder.java:447)

at hex.grid.GridSearch.buildModel(GridSearch.java:584)

at hex.grid.GridSearch.gridSearch(GridSearch.java:424)

at hex.grid.GridSearch.access$900(GridSearch.java:70)

at hex.grid.GridSearch$1.compute2(GridSearch.java:165)

at water.H2O$H2OCountedCompleter.compute(H2O.java:1677)

at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)

at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)

at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)

at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)

at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

switch to gam (non-grid search) and picking some hyperparams from above

mt_gam_w_fold <- h2o.gam( x = c("cyl", "am"), y = "mpg", fold_column = "fold", training_frame = mt, lambda = 0, keep_gam_cols = TRUE, gam_columns = "hp", num_knots = 3, spline_orders = 3, bs = 1, scale = 0.01 )

can build a CV-gam

mt_gam_w_fold

Model Details:

==============

H2ORegressionModel: gam

Model ID: GAM_model_R_1659972808725_69

glm model summary:

family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame

1 gaussian identity None 4 4 1 _95a94cf788150ab36f2652a2d33644a2

GAM Coefficients:

names coefficients standardized_coefficients

1 cyl -0.649154 -1.147070

2 am 3.438237 1.697494

3 hp_1_center_cs_0 8.604758 6.683910

4 hp_1_center_cs_1 -0.081489 -5.224166

5 Intercept 22.710483 20.090625

H2ORegressionMetrics: gam

Reported on training data.

MSE: 5.673011

RMSE: 2.381808

MAE: 1.832383

RMSLE: 0.1187604

Mean Residual Deviance : 5.673011

R^2 : 0.8387844

Null Deviance :3378.142

Null D.o.F. :95

Residual Deviance :544.6091

Residual D.o.F. :91

AIC :451.0653

H2ORegressionMetrics: gam

Reported on cross-validation data.

3-fold cross-validation on training data (Metrics computed for combined holdout predictions)

MSE: 6.703524

RMSE: 2.589116

MAE: 1.949735

RMSLE: 0.1262932

Mean Residual Deviance : 6.703524

R^2 : 0.8094993

Null Deviance :3392.581

Null D.o.F. :95

Residual Deviance :643.5383

Residual D.o.F. :91

AIC :467.089

GLM cross-validation metrics summary:

names mean sd cv_1_valid cv_2_valid cv_3_valid

1 mae 1.926825 0.318256 1.573309 2.016653 2.190513

2 mean_residual_deviance 6.459042 2.800091 3.892476 6.039368 9.445283

3 mse 6.459042 2.800091 3.892476 6.039368 9.445283

4 null_deviance 1130.860500 230.384840 1176.137700 881.198360 1335.245400

5 r2 0.810362 0.085509 0.903969 0.790765 0.736352

6 residual_deviance 214.512760 121.767790 112.881790 181.181030 349.475460

7 rmse 2.501255 0.551494 1.972936 2.457513 3.073318

8 rmsle 0.123607 0.019913 0.103134 0.124779 0.142908

{code}

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: Paul: Thanks, will take a look. W

exalate-issue-sync[bot] commented 1 year ago

Paul Donnelly commented: Hi [~accountid:557058:24e3859e-abf7-4fba-bba9-b2c3b04ad5ed] I was wondering if you were able to reproduce this issue on your system?

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-8796 Assignee: New H2O Bugs Reporter: Paul Donnelly State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

hutch3232 commented 1 year ago

I'm the original reporter on Jira. Reposting with better formatting.


The following examples show that trying to build a GAM and using cross-validation via fold_column (and also nfolds) results in Null Pointer Exception only when used inside h2o.grid, but not h2o.gam.

packageVersion("h2o")

# [1] ‘3.36.1.4’
mt <- as.h2o(mtcars)

# making mt a bigger dataset with triplicates
mt <- h2o.rbind(mt, mt, mt)

mt$fold <- h2o.kfold_column(data = mt, nfolds = 3, seed = 123)

# regular non-CV grid search
mt_grid <- h2o.grid( algorithm = "gam", x = c("cyl", "am"), y = "mpg", training_frame = mt, lambda = 0, keep_gam_cols = TRUE, hyper_params = list( gam_columns = list("hp"), num_knots = list(3, 4), spline_orders = list(3), bs = list(0, 1), scale = list(0.01) ) )

mt_grid

# H2O Grid Details
# ================
# Grid ID: Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61
# Used hyper parameters:
# - bs
# - gam_columns
# - num_knots
# - scale
# - spline_orders
# Number of models: 4
# Number of failed models: 0
# Hyper-Parameter Search Summary: ordered by increasing residual_deviance
# bs gam_columns num_knots scale spline_orders model_ids residual_deviance
# 1 1 [Ljava.lang.String;@5e2f3042 4 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_4 515.44363
# 2 1 [Ljava.lang.String;@4d9dce 3 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_2 544.60905
# 3 0 [Ljava.lang.String;@67aed100 4 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_3 545.83712
# 4 0 [Ljava.lang.String;@1479990b 3 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_1 590.73138

# same grid search, now with the "fold" column for CV models
mt_grid_w_fold <- h2o.grid( algorithm = "gam", x = c("cyl", "am"), y = "mpg", 
fold_column = "fold", # <- new arg 
training_frame = mt, lambda = 0, keep_gam_cols = TRUE, hyper_params = list( gam_columns = list("hp"), num_knots = list(3, 4), spline_orders = list(3), bs = list(0, 1), scale = list(0.01) ) )

# failure with NPE
mt_grid_w_fold

# H2O Grid Details
# ================
# Grid ID: Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_68
# Used hyper parameters:
# - bs
# - gam_columns
# - num_knots
# - scale
# - spline_orders
# Number of models: 0
# Number of failed models: 4
# NULL
# Failed models
# -------------
# bs gam_columns num_knots scale spline_orders status_failed msgs_failed
# [0] [[Ljava.lang.String;@3bc1cddc] [3] [0.01] [3] FAIL "NA"
# [1] [[Ljava.lang.String;@743b2eef] [3] [0.01] [3] FAIL "NA"
# [0] [[Ljava.lang.String;@39f5cc27] [4] [0.01] [3] FAIL "NA"
# [1] [[Ljava.lang.String;@35f21357] [4] [0.01] [3] FAIL "NA"

# all failed with same error message, copying just one
summary(mt_grid_w_fold, show_stack_traces = TRUE)

# Number of failed models: 4
# - NAjava.lang.NullPointerException
# at hex.ModelBuilder.cv_AssignFold(ModelBuilder.java:693)
# at hex.ModelBuilder.computeCrossValidation(ModelBuilder.java:613)
# at hex.ModelBuilder.trainModelNested(ModelBuilder.java:432)
# at hex.ModelBuilder$TrainModelNestedRunnable.run(ModelBuilder.java:467)
# at water.H2O.runOnH2ONode(H2O.java:1565)
# at water.H2O.runOnH2ONode(H2O.java:1554)
# at hex.ModelBuilder.trainModelNested(ModelBuilder.java:447)
# at hex.grid.GridSearch.buildModel(GridSearch.java:584)
# at hex.grid.GridSearch.gridSearch(GridSearch.java:424)
# at hex.grid.GridSearch.access$900(GridSearch.java:70)
# at hex.grid.GridSearch$1.compute2(GridSearch.java:165)
# at water.H2O$H2OCountedCompleter.compute(H2O.java:1677)
# at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
# at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
# at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)
# at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
# at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

# switch to gam (non-grid search) and picking some hyperparams from above
mt_gam_w_fold <- h2o.gam( x = c("cyl", "am"), y = "mpg", fold_column = "fold", training_frame = mt, lambda = 0, keep_gam_cols = TRUE, gam_columns = "hp", num_knots = 3, spline_orders = 3, bs = 1, scale = 0.01 )

# can build a CV-gam
mt_gam_w_fold

# Model Details:
# ==============
# H2ORegressionModel: gam
# Model ID: GAM_model_R_1659972808725_69
# glm model summary:
# family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
# 1 gaussian identity None 4 4 1 _95a94cf788150ab36f2652a2d33644a2
# GAM Coefficients:
# names coefficients standardized_coefficients
# 1 cyl -0.649154 -1.147070
# 2 am 3.438237 1.697494
# 3 hp_1_center_cs_0 8.604758 6.683910
# 4 hp_1_center_cs_1 -0.081489 -5.224166
# 5 Intercept 22.710483 20.090625
# H2ORegressionMetrics: gam
# ** Reported on training data. **
# MSE: 5.673011
# RMSE: 2.381808
# MAE: 1.832383
# RMSLE: 0.1187604
# Mean Residual Deviance : 5.673011
# R^2 : 0.8387844
# Null Deviance :3378.142
# Null D.o.F. :95
# Residual Deviance :544.6091
# Residual D.o.F. :91
# AIC :451.0653
# H2ORegressionMetrics: gam
# ** Reported on cross-validation data. **
# ** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
# MSE: 6.703524
# RMSE: 2.589116
# MAE: 1.949735
# RMSLE: 0.1262932
# Mean Residual Deviance : 6.703524
# R^2 : 0.8094993
# Null Deviance :3392.581
# Null D.o.F. :95
# Residual Deviance :643.5383
# Residual D.o.F. :91
# AIC :467.089
# GLM cross-validation metrics summary:
# names mean sd cv_1_valid cv_2_valid cv_3_valid
# 1 mae 1.926825 0.318256 1.573309 2.016653 2.190513
# 2 mean_residual_deviance 6.459042 2.800091 3.892476 6.039368 9.445283
# 3 mse 6.459042 2.800091 3.892476 6.039368 9.445283
# 4 null_deviance 1130.860500 230.384840 1176.137700 881.198360 1335.245400
# 5 r2 0.810362 0.085509 0.903969 0.790765 0.736352
# 6 residual_deviance 214.512760 121.767790 112.881790 181.181030 349.475460
# 7 rmse 2.501255 0.551494 1.972936 2.457513 3.073318
# 8 rmsle 0.123607 0.019913 0.103134 0.124779 0.142908