h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.78k stars 1.99k forks source link

Add likelihood ratio, score tests and wald tests for GLMs(logistic regression) #15703

Open karthikkannappan opened 10 months ago

karthikkannappan commented 10 months ago

Would like to see some model statistics like the likelihood ratio test, score tests and wald tests being reported for logistic regression as well. We report these model statistics for the CoxPH estimator, ( https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/coxph.html#model-statistics ), and there are customers for whom it's useful to have these statistics for logistic regression as well.

[Alternatives] The likelihood ratio test is easy to compute after the fact, if you build a null model and then use the negative likelihood of that model and the full model with the fitted coefficients. It'll still be nicer to have these model stats as part of the native GLM implementation.

wendycwong commented 8 months ago

Please break this into multiple issues.

wendycwong commented 4 months ago

Regarding the Wald Test, it is basically the z-score. This is the value before we look up the p-value. I believe we already calculate this when a user set compute_p_values = True. Hence, there is nothing that needs to be done in this case. However, please add documentation on this. You can derive the documentation change from this paragraph:

image

image

wendycwong commented 4 months ago

These youtube videos are very useful:

https://www.youtube.com/watch?v=TFKbyXAfr1M (Wald Test) https://www.youtube.com/watch?v=Ck7EChMRQ9o (Score Test) https://www.youtube.com/watch?v=Tn5y2i_MqQ8 (likelihood ratio test)

wendycwong commented 4 months ago

likelihood ratio test

H0: the coefficient of the GLM model is beta_H0.

The task is now to figure out if the beta_H0 in Hypothesis H0 is acceptable. One way to do this is to use the likelihood ratio test.

The likelihood ratio test is LR = 2*(loglikelihood(beta_ML)-loglikelihood(beta_H0) ~ Chi-square with dimension q where q is the number of predictors.

where beta_H0 is a coefficient vector that someone is interested in knowing about. Basically, this test is used to determine if the beta_H0 in the hypothesis H0 is close enough to the maximum likelihood estimate beta.

You can find the loglikelihood(beta_ML) by running our GLM algorithm and set calc_like = true. For the loglikelihood of beta_H0, you don't have to run the whole GLM model again. You just need to calculate it at beta = beta_H0. Note that the beta in beta_H0 may not be the null model. It can be anything the user wants it to be.

wendycwong commented 4 months ago

Wald Test

Again, using the same setup as the likelihood ratio test, we want to know if we can accept the hypothesis H0.

In this case, we want to use the Wald Test, w = transpose(beta_ML - beta_H0)inverse(variance at beta_ML)(beta_ML - beta_H0).

Note that if you set compute_p_value=True, you will have calculated the inverse(variance at beta_ML) at the end of running the GLM model, I think it is the standard error (but I am not 100% sure, please double check). So, this one should be easy to calculate once you have inverse variance.

wendycwong commented 4 months ago

Score Test

Using the hypothesis in likelihood ratio again, we want to evaluate the hypothesis H0 this time with Score test.

The score test = transpose(gradient of likelihood at beta_H0)inverse(variance at beta_H0)(gradient of likelihood at beta_H0)

Even though we don't care about beta_ML, you may still need to run glm model to get the dispersion parameter estimation.

Next, you will need to estimate the standard error with beta set to beta_H0 and not the beta_ML. May have to extract the part about estimating the standard error square (which is the variance) in the compute_p_value=True part in the Java backend code.

wendycwong commented 4 months ago

Since we are looking into maximum likelihood estimation, we will not allow any regularization in this case.