h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.88k stars 1.99k forks source link

Only shows results for a small fraction of the iterations #9185

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I am using h2o.glm with coordinate descent. After fitting the model, the model summary says that number_of_iterations = 23. Same with the scoring history.

However, from FLOW I can see that there were actually thousands of iterations done, and I need the scoring history for these iterations as well.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:8169cb4c-4df1-4815-8f70-641c4290cd5a], are you using lambda search?

Can you please provide example h2o.glm invocation?

exalate-issue-sync[bot] commented 1 year ago

James Hirschorn commented: Yes, I am.

However, I'm not sure that this is actually a bug. I was using cross-validation to choose the best lambda.

I'm actually not even sure what the Scoring History represents in this case. Are all of the values calculated over all folds (even "predictors")? Here is the invocation:

model_h2o <- h2o.glm(predictors, label, training_frame = all_h2o, family = 'binomial', alpha = 1, ignore_const_cols = FALSE, lambda = NULL, lambda_search = TRUE, nlambda = 50, early_stopping = FALSE, nfolds = 6)

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Hi James:

If you enable lambda search, we will return to you the model with the best error.

If you enable CV, we will again return all 6 models for you to examine if you goto Flow plus the one final model.

I tried to reproduce your problem with the following setup:

pros.hex <- h2o.importFile(path="http://s3.amazon.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") pros.hex[,2] <- as.factor(pros.hex[,2]) pros.hex[,4] <- as.factor(pros.hex[,4]) pros.hex[,5] <- as.factor(pros.hex[,5]) pros.hex[,6] <- as.factor(pros.hex[,6]) pros.hex[,9] <- as.factor(pros.hex[,9])

glmCV <- h2o.glm(x=3:9,y=2,training_frame=pros.train, family='binomial', alpha=1, ignore_const_cols=FALSE, lambda=NULL, lambda_search=TRUE, nlambda=50, early_stopping=FALSE, nfolds=6)

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Going to my flow, I was able to see all the models, the number of iterations, the scoring history showing the actual number of iterations. However, I do notice that they don't show every single iteration results, I see results for iterations for odd numbers until the last 10 or something. You can goto R to see results for all iterations.

However, I did notice something strange with the results.

During lambda search, we will train one model with the same lambda value and then move on to the next lambda value. In the end, we return the model with the best result to you. Not sure what is going on here. Let me spend a little more time and get back to you on this issue. Thank you for bringing it up to us.

Wendy

exalate-issue-sync[bot] commented 1 year ago

James Hirschorn commented: Hello Wendy,

I have returned to this project, and found that even after updating H2O (version 3.28.0.2) the issue I reported remains.

Your example works fine, but I give an MRE below where only a few interations are shown:

genex <- h2o.importFile("[https://f000.backblazeb2.com/file/host-jameshh/public-data/genex.csv")|https://f000.backblazeb2.com/file/host-jameshh/public-data/genex.csv%22)] genex[, 1] <- as.factor(genex[, 1])

glmCV <- h2o.glm(x = 2:ncol(genex), y = 'survival', training_frame = genex, family = 'multinomial', alpha = 1, lambda = NULL, lambda_search = TRUE, nlambda = 50, early_stopping = FALSE, nfolds = 6)

Then h2o.scoreHistory(glmCV) gives:

Scoring History: timestamp duration iteration lambda predictors deviance_train deviance_test 1 2020-02-24 00:23:37 0.000 sec 1 .24e0 3 2.128 NA 2 2020-02-24 00:23:38 1.015 sec 3 .22e0 6 2.076 NA 3 2020-02-24 00:23:39 2.113 sec 5 .2e0 10 1.995 NA 4 2020-02-24 00:23:41 4.434 sec 8 .18e0 13 1.850 NA deviance_xval deviance_se 1 2.124 0.075 2 2.130 0.079 3 2.198 0.076 4 2.203 0.078

only a small number of the scores.

James

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: James:

Was working on this issue for a short time before dragged off to other issues. Will return ASAP. Thanks, Wendy

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6441 Assignee: Wendy Reporter: James Hirschorn State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A