Closed exalate-issue-sync[bot] closed 1 year ago
Wendy commented: Info from Nidhi: found this - http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_genmod_examples12.htm (see dispersion param)
do we/sas set a categorical reference level this will be good to check - https://stackoverflow.com/questions/44577998/standard-errors-discrepancies-between-sas-and-r-for-glm-gamma-distribution did you have this parper in mind (edited)
https://stanford.edu/class/ee367/reading/admm_distr_stats.pdf
Wendy commented: In addition, we model our GLM with R GLMNet. Need to compare our results with theirs at the end.
Wendy commented: When compute_p_values is enabled, ADMM is disabled. So, we are fine here.
Wendy commented: Found bug in qrCholesky: random numeric columns are deemed correlated for some reason.
Wendy commented: Found the bug: zjj*rs_tot can be negative. Added Math.abs equivalent to prevent that.
Wendy commented: Figure out how qr-cholesky works now
for this gram: gram = [1 0.502460093 0.76758847 -0.131612968 -0.449094117 0.502460093 3301.476074 -33.24230345 39.31732464 -54.48851366 0.76758847 -33.24230345 3324.55848 26.02082518 20.55799099 -0.131612968 39.31732464 26.02082518 3322.824007 11.50847351 -0.449094117 -54.48851366 20.55799099 11.50847351 3365.52507];
our qr cholesky was able to come up with a correct solution where r*r'=gram.
r = [1 0 0 0 0
0.502460093 57.45627562 0 0 0
0.76758847 -0.5852796 57.65090403 0 0
-0.131612968 0.685450884 0.460062694 57.63787977 0
-0.449094117 -0.944420104 0.352985976 0.207056971 58.00227567];
I have verified this calculation with Octave.
Basically if we let X = QR, then XTX = RTQTQTR = RT*R. We need to find R.
Wendy commented: Here is where I am with this JIRA. After my fixes, the coefficients between R and H2O model agree with prostate dataset and here are the Runit test results:
H2ORegressionModel: glm Model ID: GLM_model_R_1558383867244_1 GLM Model: summary family link regularization number_of_predictors_total 1 tweedie tweedie None 7 number_of_active_predictors number_of_iterations training_frame 1 7 5 hdf
Coefficients: glm coefficients names coefficients std_error z_value p_value 1 Intercept -3.695685 0.035076 -105.360874 0.000000 2 AGE -0.007275 0.000420 -17.309447 0.000000 3 RACE -0.181603 0.009310 -19.505239 0.000000 4 DPROS 0.230095 0.002814 81.773116 0.000000 5 DCAPS 0.051499 0.007321 7.034340 0.000000 6 PSA 0.003128 0.000104 29.953570 0.000000 7 VOL -0.007237 0.000167 -43.386492 0.000000 8 GLEASON 0.431405 0.002795 154.344552 0.000000 standardized_coefficients 1 -1.109834 2 -0.046729 3 -0.053435 4 0.229443 5 0.015879 6 0.062125 7 -0.133039 8 0.470299
H2ORegressionMetrics: glm Reported on training data.
MSE: 0.1864312 RMSE: 0.4317767 MAE: 0.3769202 RMSLE: 0.2951394 Mean Residual Deviance : 0.5722311 R^2 : 0.2242269 Null Deviance :141064.9 Null D.o.F. :192511 Residual Deviance :110161.3 Residual D.o.F. :192504 AIC :NaN
[1] "R GLM model...."
Call: glm(formula = formula, family = tweedie(var.power = vpower, link.power = lpower), data = df[, x], na.action = na.omit)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6289 -0.7556 -0.5620 0.5510 1.5293
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Tweedie family taken to be 0.5347984)
Null deviance: 141065 on 192511 degrees of freedom
Residual deviance: 110161 on 192504 degrees of freedom AIC: NA
Number of Fisher Scoring iterations: 5
Wendy commented: When var_power = 1 and link_power = 0, R and H2O model agrees on the coefficients and the standard error calculation.
Wendy commented: Here is another test result:
H2ORegressionModel: glm Model ID: GLM_model_R_1558383867244_2 GLM Model: summary family link regularization number_of_predictors_total 1 tweedie tweedie None 7 number_of_active_predictors number_of_iterations training_frame 1 7 1 hdf
Coefficients: glm coefficients names coefficients std_error z_value p_value 1 Intercept -0.603554 0.006927 -87.132661 0.000000 2 AGE -0.002527 0.000087 -29.131713 0.000000 3 RACE -0.092453 0.002272 -40.694922 0.000000 4 DPROS 0.091915 0.000548 167.619524 0.000000 5 DCAPS 0.096884 0.001444 67.082910 0.000000 6 PSA 0.003675 0.000024 152.808749 0.000000 7 VOL -0.001958 0.000034 -57.911737 0.000000 8 GLEASON 0.146109 0.000506 288.974366 0.000000 standardized_coefficients 1 0.401596 2 -0.016229 3 -0.027203 4 0.091654 5 0.029872 6 0.072997 7 -0.035991 8 0.159281
H2ORegressionMetrics: glm Reported on training data.
MSE: 0.1733769 RMSE: 0.4163855 MAE: 0.3640696 RMSLE: 0.3003287 Mean Residual Deviance : 0.1733769 R^2 : 0.278548 Null Deviance :46263.83 Null D.o.F. :192511 Residual Deviance :33377.13 Residual D.o.F. :192504 AIC :NaN
[1] "R GLM model...."
Call: glm(formula = formula, family = tweedie(var.power = vpower, link.power = lpower), data = df[, x], na.action = na.omit)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9060 -0.3159 -0.1299 0.4117 0.9540
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Tweedie family taken to be 0.1733841)
Null deviance: 46264 on 192511 degrees of freedom
Residual deviance: 33377 on 192504 degrees of freedom AIC: NA
Number of Fisher Scoring iterations: 2
Wendy commented: When var_power = 0, link_power = 1, the coefficients agree but not the standard error.
Wendy commented: Here is the last case:
H2ORegressionModel: glm Model ID: GLM_model_R_1558383867244_3 GLM Model: summary family link regularization number_of_predictors_total 1 tweedie tweedie None 7 number_of_active_predictors number_of_iterations training_frame 1 7 3 RTMP_sid_a849_2
Coefficients: glm coefficients names coefficients std_error z_value p_value 1 Intercept -0.388849 0.008735 -44.514699 0.000000 2 AGE -0.001598 0.000107 -14.944006 0.000000 3 RACE -0.065185 0.002372 -27.481087 0.000000 4 DPROS 0.069614 0.000731 95.278378 0.000000 5 DCAPS 0.051713 0.002387 21.660020 0.000000 6 PSA 0.002391 0.000039 60.824821 0.000000 7 VOL -0.001447 0.000038 -38.169044 0.000000 8 GLEASON 0.103109 0.000670 153.933204 0.000000 standardized_coefficients 1 0.321019 2 -0.010262 3 -0.019180 4 0.069417 5 0.015944 6 0.047497 7 -0.026600 8 0.112405
H2ORegressionMetrics: glm Reported on training data.
MSE: 0.1750212 RMSE: 0.4183554 MAE: 0.3661164 RMSLE: 0.1693754 Mean Residual Deviance : 0.08530753 R^2 : 0.2717056 Null Deviance :22811.3 Null D.o.F. :192511 Residual Deviance :16422.72 Residual D.o.F. :192504 AIC :NaN
[1] "R GLM model...."
Call: glm(formula = formula, family = tweedie(var.power = vpower, link.power = lpower), data = df[, x], na.action = na.omit)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6156 -0.2486 -0.1244 0.2492 0.7001
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Tweedie family taken to be 0.0900009)
Null deviance: 22811 on 192511 degrees of freedom
Residual deviance: 16423 on 192504 degrees of freedom AIC: NA
Number of Fisher Scoring iterations: 4
Wendy commented: When var_power=2 and link_power=1, the coefficients agree but the standard errors do not. However, they are closer than the case when var_power = 0.
Wendy commented: I noticed that if the relationship link_power = 1-var_power holds, my computation and R's will agree. However, when that relationship no longer holds, the coefficients found by my algo and R will start to diverge. I implemented this from first principal without any approximation while R may use some form of approximation to the hessian and gradient. As we diverge from the relationship, link_power=1-var_power, the approximation will diverge from the true hessian and gradient calculations.
There are places where to avoid 1/0, I will choose to do 1/1e-6. These kinds of arbitrary substitution will make algorithm comparisons difficult.
JIRA Issue Migration Info
Jira Issue: PUBDEV-6457 Assignee: Wendy Reporter: Wendy State: Resolved Fix Version: 3.24.0.5 Attachments: Available (Count: 1) Development PRs: Available
Linked PRs from JIRA
https://github.com/h2oai/h2o-3/pull/3515 https://github.com/h2oai/h2o-3/pull/3516
Attachments From Jira
Attachment Name: GLMStdErrorCal.pdf Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6457/GLMStdErrorCal.pdf
Wendy commented: Two tasks: