h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

H2O quasibinomial glm with 2 column response #8819

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I'm try to setup a logistic regression in R with H2O GLM using two columns as the response. The two columns in my dataframe corresponds to the number of success and failures respectively. I can do this with R's built-in glm, but, I'm having difficulties doing this with H2O's glm.

On the [documentation|http://h2o-release.s3.amazonaws.com/h2o/rel-tutte/2/docs-website/h2o-docs/data-science/algo-params/interactions.html], I think this is possible. It says that:

{quote}For example, a typical predictor has the form ‘response ~ terms’ where ‘response’ is the (numeric) response vector, and ‘terms’ is a series of terms that specify a linear predictor for ‘response’. For ‘binomial’ and ‘quasibinomial’ families, the response can also be specified as a ‘factor’ (when the first level denotes failure and all other levels denote success) or as a two-column matrix with the columns giving the numbers of successes and failures.{quote}

Basically, what I'm trying to accomplish is the following operation, using R's built-in glm:

{code:java} fit1 <- glm(cbind(success, failure) ~ age + height, d.glm, family="quasibinomial") {code}

In H2O, I've tried

{code:java} predictors <- c("age", "height"); response <- c("success", "failure"); d.glm.h2o <- as.h2o(d.glm)

fit2 <- h2o.glm(family="quasibinomial", x= predictors, y=response, training_frame=d.glm.h2o, remove_collinear_columns=T, lambda = 0, compute_p_values = TRUE, standardize = TRUE) {code}

The error I'm getting with the H2O code is:

{code:java} ERRR on field: _response_column: Response column 'c("success", "failure")' not found in the training frame

{code}

I would like to use H2O glm instead of R's glm in the future to make use of its regularization capabilities.

Thanks for helping.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: quasibinomial doesn’t support 2 columns as a response, none of h2o algos support multivariate supervised learning

[~accountid:5bd237b8dd3cc64b77e71676] please explain in detail what QB can be used for

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: also fiy: [~accountid:557058:6e44bc1a-dd50-499b-a331-2e049f28773b] the documentation is incorrect, [~accountid:5bd237b8dd3cc64b77e71676] will share more details

exalate-issue-sync[bot] commented 1 year ago

Veronika Maurerová commented: There is the text about the interaction parameter ([http://h2o-release.s3.amazonaws.com/h2o/rel-tutte/2/docs-website/h2o-docs/data-science/algo-params/interactions.html|http://h2o-release.s3.amazonaws.com/h2o/rel-tutte/2/docs-website/h2o-docs/data-science/algo-params/interactions.html]) which is copied from R glm() documentation ([https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm|https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm]) and I think it is not working with our GLM implementation. [~accountid:557058:24e3859e-abf7-4fba-bba9-b2c3b04ad5ed], you are GLM expert, could you please take a look? There is also associated and opened StackOverflow question: [https://stackoverflow.com/questions/57503263/h2o-quasibinomial-glm-with-2-column-response|https://stackoverflow.com/questions/57503263/h2o-quasibinomial-glm-with-2-column-response].

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6814 Assignee: Wendy Wong Reporter: Jaaved Mohammed State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A