Open exalate-issue-sync[bot] opened 1 year ago
Wendy Wong commented: Paul: Thanks, will take a look. W
Paul Donnelly commented: Hi [~accountid:557058:24e3859e-abf7-4fba-bba9-b2c3b04ad5ed] I was wondering if you were able to reproduce this issue on your system?
JIRA Issue Details
Jira Issue: PUBDEV-8796 Assignee: New H2O Bugs Reporter: Paul Donnelly State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
I'm the original reporter on Jira. Reposting with better formatting.
The following examples show that trying to build a GAM and using cross-validation via fold_column
(and also nfolds
) results in Null Pointer Exception only when used inside h2o.grid
, but not h2o.gam
.
packageVersion("h2o")
# [1] ‘3.36.1.4’
mt <- as.h2o(mtcars)
# making mt a bigger dataset with triplicates
mt <- h2o.rbind(mt, mt, mt)
mt$fold <- h2o.kfold_column(data = mt, nfolds = 3, seed = 123)
# regular non-CV grid search
mt_grid <- h2o.grid( algorithm = "gam", x = c("cyl", "am"), y = "mpg", training_frame = mt, lambda = 0, keep_gam_cols = TRUE, hyper_params = list( gam_columns = list("hp"), num_knots = list(3, 4), spline_orders = list(3), bs = list(0, 1), scale = list(0.01) ) )
mt_grid
# H2O Grid Details
# ================
# Grid ID: Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61
# Used hyper parameters:
# - bs
# - gam_columns
# - num_knots
# - scale
# - spline_orders
# Number of models: 4
# Number of failed models: 0
# Hyper-Parameter Search Summary: ordered by increasing residual_deviance
# bs gam_columns num_knots scale spline_orders model_ids residual_deviance
# 1 1 [Ljava.lang.String;@5e2f3042 4 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_4 515.44363
# 2 1 [Ljava.lang.String;@4d9dce 3 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_2 544.60905
# 3 0 [Ljava.lang.String;@67aed100 4 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_3 545.83712
# 4 0 [Ljava.lang.String;@1479990b 3 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_1 590.73138
# same grid search, now with the "fold" column for CV models
mt_grid_w_fold <- h2o.grid( algorithm = "gam", x = c("cyl", "am"), y = "mpg",
fold_column = "fold", # <- new arg
training_frame = mt, lambda = 0, keep_gam_cols = TRUE, hyper_params = list( gam_columns = list("hp"), num_knots = list(3, 4), spline_orders = list(3), bs = list(0, 1), scale = list(0.01) ) )
# failure with NPE
mt_grid_w_fold
# H2O Grid Details
# ================
# Grid ID: Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_68
# Used hyper parameters:
# - bs
# - gam_columns
# - num_knots
# - scale
# - spline_orders
# Number of models: 0
# Number of failed models: 4
# NULL
# Failed models
# -------------
# bs gam_columns num_knots scale spline_orders status_failed msgs_failed
# [0] [[Ljava.lang.String;@3bc1cddc] [3] [0.01] [3] FAIL "NA"
# [1] [[Ljava.lang.String;@743b2eef] [3] [0.01] [3] FAIL "NA"
# [0] [[Ljava.lang.String;@39f5cc27] [4] [0.01] [3] FAIL "NA"
# [1] [[Ljava.lang.String;@35f21357] [4] [0.01] [3] FAIL "NA"
# all failed with same error message, copying just one
summary(mt_grid_w_fold, show_stack_traces = TRUE)
# Number of failed models: 4
# - NAjava.lang.NullPointerException
# at hex.ModelBuilder.cv_AssignFold(ModelBuilder.java:693)
# at hex.ModelBuilder.computeCrossValidation(ModelBuilder.java:613)
# at hex.ModelBuilder.trainModelNested(ModelBuilder.java:432)
# at hex.ModelBuilder$TrainModelNestedRunnable.run(ModelBuilder.java:467)
# at water.H2O.runOnH2ONode(H2O.java:1565)
# at water.H2O.runOnH2ONode(H2O.java:1554)
# at hex.ModelBuilder.trainModelNested(ModelBuilder.java:447)
# at hex.grid.GridSearch.buildModel(GridSearch.java:584)
# at hex.grid.GridSearch.gridSearch(GridSearch.java:424)
# at hex.grid.GridSearch.access$900(GridSearch.java:70)
# at hex.grid.GridSearch$1.compute2(GridSearch.java:165)
# at water.H2O$H2OCountedCompleter.compute(H2O.java:1677)
# at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
# at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
# at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)
# at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
# at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
# switch to gam (non-grid search) and picking some hyperparams from above
mt_gam_w_fold <- h2o.gam( x = c("cyl", "am"), y = "mpg", fold_column = "fold", training_frame = mt, lambda = 0, keep_gam_cols = TRUE, gam_columns = "hp", num_knots = 3, spline_orders = 3, bs = 1, scale = 0.01 )
# can build a CV-gam
mt_gam_w_fold
# Model Details:
# ==============
# H2ORegressionModel: gam
# Model ID: GAM_model_R_1659972808725_69
# glm model summary:
# family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
# 1 gaussian identity None 4 4 1 _95a94cf788150ab36f2652a2d33644a2
# GAM Coefficients:
# names coefficients standardized_coefficients
# 1 cyl -0.649154 -1.147070
# 2 am 3.438237 1.697494
# 3 hp_1_center_cs_0 8.604758 6.683910
# 4 hp_1_center_cs_1 -0.081489 -5.224166
# 5 Intercept 22.710483 20.090625
# H2ORegressionMetrics: gam
# ** Reported on training data. **
# MSE: 5.673011
# RMSE: 2.381808
# MAE: 1.832383
# RMSLE: 0.1187604
# Mean Residual Deviance : 5.673011
# R^2 : 0.8387844
# Null Deviance :3378.142
# Null D.o.F. :95
# Residual Deviance :544.6091
# Residual D.o.F. :91
# AIC :451.0653
# H2ORegressionMetrics: gam
# ** Reported on cross-validation data. **
# ** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
# MSE: 6.703524
# RMSE: 2.589116
# MAE: 1.949735
# RMSLE: 0.1262932
# Mean Residual Deviance : 6.703524
# R^2 : 0.8094993
# Null Deviance :3392.581
# Null D.o.F. :95
# Residual Deviance :643.5383
# Residual D.o.F. :91
# AIC :467.089
# GLM cross-validation metrics summary:
# names mean sd cv_1_valid cv_2_valid cv_3_valid
# 1 mae 1.926825 0.318256 1.573309 2.016653 2.190513
# 2 mean_residual_deviance 6.459042 2.800091 3.892476 6.039368 9.445283
# 3 mse 6.459042 2.800091 3.892476 6.039368 9.445283
# 4 null_deviance 1130.860500 230.384840 1176.137700 881.198360 1335.245400
# 5 r2 0.810362 0.085509 0.903969 0.790765 0.736352
# 6 residual_deviance 214.512760 121.767790 112.881790 181.181030 349.475460
# 7 rmse 2.501255 0.551494 1.972936 2.457513 3.073318
# 8 rmsle 0.123607 0.019913 0.103134 0.124779 0.142908
The following examples show that trying to build a GAM and using cross-validation via {{fold_column}} (and also {{nfolds}}) results in Null Pointer Exception only when used inside {{h2o.grid}}, but not {{h2o.gam}}.
{code:r}packageVersion("h2o")
[1] ‘3.36.1.4’
mt <- as.h2o(mtcars)
making mt a bigger dataset with triplicates
mt <- h2o.rbind(mt, mt, mt)
mt$fold <- h2o.kfold_column(data = mt, nfolds = 3, seed = 123)
regular non-CV grid search
mt_grid <- h2o.grid( algorithm = "gam", x = c("cyl", "am"), y = "mpg", training_frame = mt, lambda = 0, keep_gam_cols = TRUE, hyper_params = list( gam_columns = list("hp"), num_knots = list(3, 4), spline_orders = list(3), bs = list(0, 1), scale = list(0.01) ) )
mt_grid
H2O Grid Details
================
Grid ID: Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61
Used hyper parameters:
- bs
- gam_columns
- num_knots
- scale
- spline_orders
Number of models: 4
Number of failed models: 0
Hyper-Parameter Search Summary: ordered by increasing residual_deviance
bs gam_columns num_knots scale spline_orders model_ids residual_deviance
1 1 [Ljava.lang.String;@5e2f3042 4 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_4 515.44363
2 1 [Ljava.lang.String;@4d9dce 3 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_2 544.60905
3 0 [Ljava.lang.String;@67aed100 4 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_3 545.83712
4 0 [Ljava.lang.String;@1479990b 3 0.01 3 Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_61_model_1 590.73138
same grid search, now with the "fold" column for CV models
mt_grid_w_fold <- h2o.grid( algorithm = "gam", x = c("cyl", "am"), y = "mpg", fold_column = "fold", # <- new arg training_frame = mt, lambda = 0, keep_gam_cols = TRUE, hyper_params = list( gam_columns = list("hp"), num_knots = list(3, 4), spline_orders = list(3), bs = list(0, 1), scale = list(0.01) ) )
failure with NPE
mt_grid_w_fold
H2O Grid Details
================
Grid ID: Grid_GAM_RTMP_sid_91a4_37_model_R_1659972808725_68
Used hyper parameters:
- bs
- gam_columns
- num_knots
- scale
- spline_orders
Number of models: 0
Number of failed models: 4
NULL
Failed models
-------------
bs gam_columns num_knots scale spline_orders status_failed msgs_failed
[0] [[Ljava.lang.String;@3bc1cddc] [3] [0.01] [3] FAIL "NA"
[1] [[Ljava.lang.String;@743b2eef] [3] [0.01] [3] FAIL "NA"
[0] [[Ljava.lang.String;@39f5cc27] [4] [0.01] [3] FAIL "NA"
[1] [[Ljava.lang.String;@35f21357] [4] [0.01] [3] FAIL "NA"
all failed with same error message, copying just one
summary(mt_grid_w_fold, show_stack_traces = TRUE)
Number of failed models: 4
- NAjava.lang.NullPointerException
at hex.ModelBuilder.cv_AssignFold(ModelBuilder.java:693)
at hex.ModelBuilder.computeCrossValidation(ModelBuilder.java:613)
at hex.ModelBuilder.trainModelNested(ModelBuilder.java:432)
at hex.ModelBuilder$TrainModelNestedRunnable.run(ModelBuilder.java:467)
at water.H2O.runOnH2ONode(H2O.java:1565)
at water.H2O.runOnH2ONode(H2O.java:1554)
at hex.ModelBuilder.trainModelNested(ModelBuilder.java:447)
at hex.grid.GridSearch.buildModel(GridSearch.java:584)
at hex.grid.GridSearch.gridSearch(GridSearch.java:424)
at hex.grid.GridSearch.access$900(GridSearch.java:70)
at hex.grid.GridSearch$1.compute2(GridSearch.java:165)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1677)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
switch to gam (non-grid search) and picking some hyperparams from above
mt_gam_w_fold <- h2o.gam( x = c("cyl", "am"), y = "mpg", fold_column = "fold", training_frame = mt, lambda = 0, keep_gam_cols = TRUE, gam_columns = "hp", num_knots = 3, spline_orders = 3, bs = 1, scale = 0.01 )
can build a CV-gam
mt_gam_w_fold
Model Details:
==============
H2ORegressionModel: gam
Model ID: GAM_model_R_1659972808725_69
glm model summary:
family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
1 gaussian identity None 4 4 1 _95a94cf788150ab36f2652a2d33644a2
GAM Coefficients:
names coefficients standardized_coefficients
1 cyl -0.649154 -1.147070
2 am 3.438237 1.697494
3 hp_1_center_cs_0 8.604758 6.683910
4 hp_1_center_cs_1 -0.081489 -5.224166
5 Intercept 22.710483 20.090625
H2ORegressionMetrics: gam
Reported on training data.
MSE: 5.673011
RMSE: 2.381808
MAE: 1.832383
RMSLE: 0.1187604
Mean Residual Deviance : 5.673011
R^2 : 0.8387844
Null Deviance :3378.142
Null D.o.F. :95
Residual Deviance :544.6091
Residual D.o.F. :91
AIC :451.0653
H2ORegressionMetrics: gam
Reported on cross-validation data.
3-fold cross-validation on training data (Metrics computed for combined holdout predictions)
MSE: 6.703524
RMSE: 2.589116
MAE: 1.949735
RMSLE: 0.1262932
Mean Residual Deviance : 6.703524
R^2 : 0.8094993
Null Deviance :3392.581
Null D.o.F. :95
Residual Deviance :643.5383
Residual D.o.F. :91
AIC :467.089
GLM cross-validation metrics summary:
names mean sd cv_1_valid cv_2_valid cv_3_valid
1 mae 1.926825 0.318256 1.573309 2.016653 2.190513
2 mean_residual_deviance 6.459042 2.800091 3.892476 6.039368 9.445283
3 mse 6.459042 2.800091 3.892476 6.039368 9.445283
4 null_deviance 1130.860500 230.384840 1176.137700 881.198360 1335.245400
5 r2 0.810362 0.085509 0.903969 0.790765 0.736352
6 residual_deviance 214.512760 121.767790 112.881790 181.181030 349.475460
7 rmse 2.501255 0.551494 1.972936 2.457513 3.073318
8 rmsle 0.123607 0.019913 0.103134 0.124779 0.142908
{code}