h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 2k forks source link

GLM std-error calculation, disable and enable ADMM and check effect on error calculation for Tweedie #9169

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Two tasks:

  1. figure out how std_error is calculated;
  2. generate synthetic dataset, get coefficients with and without admm.
exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Info from Nidhi: found this - http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_genmod_examples12.htm (see dispersion param)

do we/sas set a categorical reference level this will be good to check - https://stackoverflow.com/questions/44577998/standard-errors-discrepancies-between-sas-and-r-for-glm-gamma-distribution did you have this parper in mind (edited)

https://stanford.edu/class/ee367/reading/admm_distr_stats.pdf

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: In addition, we model our GLM with R GLMNet. Need to compare our results with theirs at the end.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: When compute_p_values is enabled, ADMM is disabled. So, we are fine here.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Found bug in qrCholesky: random numeric columns are deemed correlated for some reason.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Found the bug: zjj*rs_tot can be negative. Added Math.abs equivalent to prevent that.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Figure out how qr-cholesky works now

for this gram: gram = [1 0.502460093 0.76758847 -0.131612968 -0.449094117 0.502460093 3301.476074 -33.24230345 39.31732464 -54.48851366 0.76758847 -33.24230345 3324.55848 26.02082518 20.55799099 -0.131612968 39.31732464 26.02082518 3322.824007 11.50847351 -0.449094117 -54.48851366 20.55799099 11.50847351 3365.52507];

our qr cholesky was able to come up with a correct solution where r*r'=gram.
r = [1 0 0 0 0 0.502460093 57.45627562 0 0 0 0.76758847 -0.5852796 57.65090403 0 0 -0.131612968 0.685450884 0.460062694 57.63787977 0 -0.449094117 -0.944420104 0.352985976 0.207056971 58.00227567];

I have verified this calculation with Octave.

Basically if we let X = QR, then XTX = RTQTQTR = RT*R. We need to find R.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Here is where I am with this JIRA. After my fixes, the coefficients between R and H2O model agree with prostate dataset and here are the Runit test results:

[1] "Compare H2O, R GLM model coefficients and standard error for var_power=1, link_power=0" [1] "Define formula for R" |======================================================================| 100% [1] "H2O GLM model...." Model Details:

H2ORegressionModel: glm Model ID: GLM_model_R_1558383867244_1 GLM Model: summary family link regularization number_of_predictors_total 1 tweedie tweedie None 7 number_of_active_predictors number_of_iterations training_frame 1 7 5 hdf

Coefficients: glm coefficients names coefficients std_error z_value p_value 1 Intercept -3.695685 0.035076 -105.360874 0.000000 2 AGE -0.007275 0.000420 -17.309447 0.000000 3 RACE -0.181603 0.009310 -19.505239 0.000000 4 DPROS 0.230095 0.002814 81.773116 0.000000 5 DCAPS 0.051499 0.007321 7.034340 0.000000 6 PSA 0.003128 0.000104 29.953570 0.000000 7 VOL -0.007237 0.000167 -43.386492 0.000000 8 GLEASON 0.431405 0.002795 154.344552 0.000000 standardized_coefficients 1 -1.109834 2 -0.046729 3 -0.053435 4 0.229443 5 0.015879 6 0.062125 7 -0.133039 8 0.470299

H2ORegressionMetrics: glm Reported on training data.

MSE: 0.1864312 RMSE: 0.4317767 MAE: 0.3769202 RMSLE: 0.2951394 Mean Residual Deviance : 0.5722311 R^2 : 0.2242269 Null Deviance :141064.9 Null D.o.F. :192511 Residual Deviance :110161.3 Residual D.o.F. :192504 AIC :NaN

[1] "R GLM model...."

Call: glm(formula = formula, family = tweedie(var.power = vpower, link.power = lpower), data = df[, x], na.action = na.omit)

Deviance Residuals: Min 1Q Median 3Q Max
-1.6289 -0.7556 -0.5620 0.5510 1.5293

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.6956847 0.0350766 -105.360 < 0.0000000000000002 AGE -0.0072755 0.0004203 -17.309 < 0.0000000000000002 RACE -0.1816025 0.0093105 -19.505 < 0.0000000000000002 DCAPS 0.0514992 0.0073212 7.034 0.00000000000201 PSA 0.0031275 0.0001044 29.953 < 0.0000000000000002 VOL -0.0072373 0.0001668 -43.386 < 0.0000000000000002 DPROS 0.2300954 0.0028138 81.773 < 0.0000000000000002 GLEASON 0.4314049 0.0027951 154.344 < 0.0000000000000002

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Tweedie family taken to be 0.5347984)

Null deviance: 141065  on 192511  degrees of freedom

Residual deviance: 110161 on 192504 degrees of freedom AIC: NA

Number of Fisher Scoring iterations: 5

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: When var_power = 1 and link_power = 0, R and H2O model agrees on the coefficients and the standard error calculation.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Here is another test result:

[1] "Compare H2O, R GLM model coefficients and standard error for var_power=0, link_power=1" [1] "Define formula for R" |======================================================================| 100% [1] "H2O GLM model...." Model Details:

H2ORegressionModel: glm Model ID: GLM_model_R_1558383867244_2 GLM Model: summary family link regularization number_of_predictors_total 1 tweedie tweedie None 7 number_of_active_predictors number_of_iterations training_frame 1 7 1 hdf

Coefficients: glm coefficients names coefficients std_error z_value p_value 1 Intercept -0.603554 0.006927 -87.132661 0.000000 2 AGE -0.002527 0.000087 -29.131713 0.000000 3 RACE -0.092453 0.002272 -40.694922 0.000000 4 DPROS 0.091915 0.000548 167.619524 0.000000 5 DCAPS 0.096884 0.001444 67.082910 0.000000 6 PSA 0.003675 0.000024 152.808749 0.000000 7 VOL -0.001958 0.000034 -57.911737 0.000000 8 GLEASON 0.146109 0.000506 288.974366 0.000000 standardized_coefficients 1 0.401596 2 -0.016229 3 -0.027203 4 0.091654 5 0.029872 6 0.072997 7 -0.035991 8 0.159281

H2ORegressionMetrics: glm Reported on training data.

MSE: 0.1733769 RMSE: 0.4163855 MAE: 0.3640696 RMSLE: 0.3003287 Mean Residual Deviance : 0.1733769 R^2 : 0.278548 Null Deviance :46263.83 Null D.o.F. :192511 Residual Deviance :33377.13 Residual D.o.F. :192504 AIC :NaN

[1] "R GLM model...."

Call: glm(formula = formula, family = tweedie(var.power = vpower, link.power = lpower), data = df[, x], na.action = na.omit)

Deviance Residuals: Min 1Q Median 3Q Max
-0.9060 -0.3159 -0.1299 0.4117 0.9540

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.60355374 0.01217832 -49.56 <0.0000000000000002 AGE -0.00252677 0.00014929 -16.93 <0.0000000000000002 RACE -0.09245281 0.00328654 -28.13 <0.0000000000000002 DCAPS 0.09688415 0.00333927 29.01 <0.0000000000000002 PSA 0.00367488 0.00005377 68.34 <0.0000000000000002 VOL -0.00195788 0.00005264 -37.20 <0.0000000000000002 DPROS 0.09191481 0.00101050 90.96 <0.0000000000000002 GLEASON 0.14610869 0.00097625 149.66 <0.0000000000000002

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Tweedie family taken to be 0.1733841)

Null deviance: 46264  on 192511  degrees of freedom

Residual deviance: 33377 on 192504 degrees of freedom AIC: NA

Number of Fisher Scoring iterations: 2

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: When var_power = 0, link_power = 1, the coefficients agree but not the standard error.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Here is the last case:

[1] "Compare H2O, R GLM model coefficients and standard error for var_power=2, link_power=0" [1] "Define formula for R" |======================================================================| 100% [1] "H2O GLM model...." Model Details:

H2ORegressionModel: glm Model ID: GLM_model_R_1558383867244_3 GLM Model: summary family link regularization number_of_predictors_total 1 tweedie tweedie None 7 number_of_active_predictors number_of_iterations training_frame 1 7 3 RTMP_sid_a849_2

Coefficients: glm coefficients names coefficients std_error z_value p_value 1 Intercept -0.388849 0.008735 -44.514699 0.000000 2 AGE -0.001598 0.000107 -14.944006 0.000000 3 RACE -0.065185 0.002372 -27.481087 0.000000 4 DPROS 0.069614 0.000731 95.278378 0.000000 5 DCAPS 0.051713 0.002387 21.660020 0.000000 6 PSA 0.002391 0.000039 60.824821 0.000000 7 VOL -0.001447 0.000038 -38.169044 0.000000 8 GLEASON 0.103109 0.000670 153.933204 0.000000 standardized_coefficients 1 0.321019 2 -0.010262 3 -0.019180 4 0.069417 5 0.015944 6 0.047497 7 -0.026600 8 0.112405

H2ORegressionMetrics: glm Reported on training data.

MSE: 0.1750212 RMSE: 0.4183554 MAE: 0.3661164 RMSLE: 0.1693754 Mean Residual Deviance : 0.08530753 R^2 : 0.2717056 Null Deviance :22811.3 Null D.o.F. :192511 Residual Deviance :16422.72 Residual D.o.F. :192504 AIC :NaN

[1] "R GLM model...."

Call: glm(formula = formula, family = tweedie(var.power = vpower, link.power = lpower), data = df[, x], na.action = na.omit)

Deviance Residuals: Min 1Q Median 3Q Max
-0.6156 -0.2486 -0.1244 0.2492 0.7001

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.38886083 0.00877417 -44.32 <0.0000000000000002 AGE -0.00159777 0.00010756 -14.85 <0.0000000000000002 RACE -0.06518412 0.00236787 -27.53 <0.0000000000000002 DCAPS 0.05171215 0.00240586 21.49 <0.0000000000000002 PSA 0.00239114 0.00003874 61.72 <0.0000000000000002 VOL -0.00144703 0.00003792 -38.16 <0.0000000000000002 DPROS 0.06961443 0.00072804 95.62 <0.0000000000000002 GLEASON 0.10311169 0.00070336 146.60 <0.0000000000000002

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Tweedie family taken to be 0.0900009)

Null deviance: 22811  on 192511  degrees of freedom

Residual deviance: 16423 on 192504 degrees of freedom AIC: NA

Number of Fisher Scoring iterations: 4

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: When var_power=2 and link_power=1, the coefficients agree but the standard errors do not. However, they are closer than the case when var_power = 0.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: I noticed that if the relationship link_power = 1-var_power holds, my computation and R's will agree. However, when that relationship no longer holds, the coefficients found by my algo and R will start to diverge. I implemented this from first principal without any approximation while R may use some form of approximation to the hessian and gradient. As we diverge from the relationship, link_power=1-var_power, the approximation will diverge from the true hessian and gradient calculations.

There are places where to avoid 1/0, I will choose to do 1/1e-6. These kinds of arbitrary substitution will make algorithm comparisons difficult.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6457 Assignee: Wendy Reporter: Wendy State: Resolved Fix Version: 3.24.0.5 Attachments: Available (Count: 1) Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/3515 https://github.com/h2oai/h2o-3/pull/3516

Attachments From Jira

Attachment Name: GLMStdErrorCal.pdf Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6457/GLMStdErrorCal.pdf