h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

Verify GLM Binomial IRLSM implementation and p-value calculation, backward selection performance vs calling GLM directly #7080

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: [https://stats.stackexchange.com/questions/89484/how-to-compute-the-standard-errors-of-a-logistic-regressions-coefficients|https://stats.stackexchange.com/questions/89484/how-to-compute-the-standard-errors-of-a-logistic-regressions-coefficients|smart-link]

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: Our model selection backward mode was able to generate same coefficients and p-values of eliminated predictors as our competitor software. However the same cannot be said for binomial family. Hence, the job here is to check and make sure that for binomial family:

GLM coefficients generated at each backward elimination stage matches competitor software

p-value of eliminated predictors matches competitor software.

Instead of verifying 1 and 2, we just need to verify 2 since the coefficients have to be the same to generate the same p-values.

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: Added a description of what is going on with H2O GLM Gaussian and Binomial family implementations.

[^H2OGLMExplain (c6f1eb0b-d672-4d35-b099-1a354c46cbc3).pdf]

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: There are two goals here in the PR:

Make sure our p-value calculation matches the one in [https://stats.stackexchange.com/questions/89484/how-to-compute-the-standard-errors-of-a-logistic-regressions-coefficients|https://stats.stackexchange.com/questions/89484/how-to-compute-the-standard-errors-of-a-logistic-regressions-coefficients|smart-link] ;

Make sure our binomial IRLSM implementation matches the one described in the elements of statistical learning (ESL).

Regarding 1: I checked the covariance matrix calculation found that we have implemented the same calculation in our p-value calculation. To prove that I added a java test to manually calculate the covariance matrix, the standard error, the z-value and then the p-value. I then compared this calculation with the on in the GLM code and they match well. I even use a different method to perform the equivalent of finding the inverse of the covariance matrix.

To do for 1:

Check p-value calculation match when standardize = false;

Regarding 2 I have completed the manual implementation on deriving the IRLSM coefficients using the formulae derived in ESL for both standardize = false and true. The h2o implementation matches well with the ones in the book.

I went back and look at the run process and discovered that line search was not enabled at any point. I gather the only reason that the two models are different between H2O and other tool must be due to the number of iterations run. If we can control the iterations, we may be able to get matched coefficients and p-values of eliminated predictors.

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: This is actually a great test to use! Wished I have come up with it.

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-8585 Assignee: Wendy Wong Reporter: Wendy Wong State: Resolved Fix Version: 3.36.1.1 Attachments: Available (Count: 1) Development PRs: Available

h2o-ops commented 1 year ago

Attachments From Jira

Attachment Name: H2OGLMExplain (c6f1eb0b-d672-4d35-b099-1a354c46cbc3).pdf Attached By: Wendy Wong File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-8585/H2OGLMExplain (c6f1eb0b-d672-4d35-b099-1a354c46cbc3).pdf

h2o-ops commented 1 year ago

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/6096