h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 2k forks source link

GLM fit fails if interactions present with lambda_search=TRUE #8876

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I have a problem where a glm fit is failing (from R) if I have lambda_search = TRUE and I include interactions. If I set the solver to COORDINATE_DESCENT_NAIVE then it works alright. The error I am getting is "water.exceptions.H2OConcurrentModificationException: Rollups not possible, because Vec was deleted". On another dataset I am instead getting an ArrayIndexOutOfBoundsException, but it is occurring under the same circumstances. I am not sure if the different error is due to it using a different solver.

I originally encountered it attempting to use h2o.grid, but it happens with plan h2o.glm too.

I did already report this via the gitter chat but thought I should formally create an issue here.

Code that reproduces the issue for me is: {code:R} library(h2o)

set.seed(1234)

v1 <- rnorm(10000,5,10) v2 <- rnorm(10000, 0,1) v3 <- rnorm(10000, 5,11) v4 <- factor(sample(c("A", "B", "C", "D", "E"), 10000, replace=TRUE, prob=c(0.4,0.3,0.1,0.1,0.1))) v5 <- factor(sample(c("F","T"), 10000, replace=TRUE, prob=c(0.7,0.3)))

y <- rpois(10000, exp(0.0001 (5 v1 + ifelse(v5=="T",6,0) v2 + 7 v2 + 3 (v5=="F") + ((v4 == "A") 3 + (v4 == "B") 6 + (v4 == "C") 1 + (v4 == "D") 9 + (v4 == "E") 20) v3 19

h2o.init(nthreads=3, min_mem_size = "8G", enable_assertions=FALSE)

h2odat <- as.h2o(cbind(y, v1, v2, v3, v4, v5))

h2o.glm(y="y", training_frame = h2odat, family="poisson", nfolds=10, interaction_pairs=list(c("v2", "v5"), c("v4", "v3")), lambda_search=TRUE, solver="AUTO") {code}

exalate-issue-sync[bot] commented 1 year ago

Marc Burgess commented: I experimented with this a bit more. If I run a grid search over both alpha and lambda it doesn’t fail. However, if I turn on remove_collinear_columns, most of the grid search calculations complete but a minority fail with an ArrayIndexOutOfBounds error (I was also using 1 000 000 samples instead of 10 000).

{code}h2ores <- h2o.grid(algorithm="glm", grid_id = "h2o_search", y="y", training_frame=h2odat, family="poisson", interaction_pairs=list(c("v2", "v5"), c("v4", "v3")), nfolds=10, remove_collinear_columns=TRUE, hyper_params=list(alpha=c(0,0.2,0.4,0.6,0.8,1.0), lambda=c(1.0,0.5,0.1,0.01,0.001,0.0001,0.00001,0))){code}

Stack trace for one of the failed grid entries:

{code}Hyper-parameter: alpha, 1 Hyper-parameter: lambda, 1e-04 [2019-08-08 12:09:32] failure_details: NA [2019-08-08 12:09:32] failure_stack_traces: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at hex.DataInfo.coefNames(DataInfo.java:723) at hex.glm.GLM$GLMDriver.ADMM_solve(GLM.java:636) at hex.glm.GLM$GLMDriver.fitIRLSM(GLM.java:832) at hex.glm.GLM$GLMDriver.fitModel(GLM.java:1098) at hex.glm.GLM$GLMDriver.computeSubmodel(GLM.java:1191) at hex.glm.GLM$GLMDriver.computeImpl(GLM.java:1279) at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:222) at hex.glm.GLM$GLMDriver.compute2(GLM.java:588) at water.H2O$H2OCountedCompleter.compute(H2O.java:1417) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104){code}

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6757 Assignee: New H2O Bugs Reporter: Marc Burgess State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A